Big Data in Clinical Research
According to Wikipedia, “big data” consists of “data sets that are so large or complex that traditional data processing applications are inadequate… The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.”
IBM has proposed a four-dimensional model of big data: volume, velocity, variety and veracity. The biggest big data set is very large, grows very fast, includes numerous types of data from multiple sources, and provides the most accurate and complete picture.
In clinical research, the term “big data” is often used loosely to refer to large data sets, non-relational datasets, or advanced methods of visualization. For example, a “large simple trial” with 25,000 patients conducted 20 years ago would probably be considered an exercise in big data today, even though the data set is large only in comparison to other clinical trial data sets, is static in size, includes very few types of data (which can be captured in a relational database), and presents a very narrow, albeit accurate and precise, picture.
Big data technology is now (or in the near future) being used to address a variety of important clinical research challenges, including the following:
- - Literature review
- - Feasibility analysis
- - Site selection
- - Patient recruitment
- - Social listening for patient sentiment and study participant chatter
- - Pharmacovigilance/safety monitoring
- - Data and safety monitoring by DSMBs
- - Risk-based monitoring
- - Study forecasting
- - Meta-analysis across multiple studies (and multiple drugs), e.g., to measure efficacy, safety or placebo effects
A big data application is likely to use one or more of the following technologies/buzz words (with simplified definitions):
- - Machine learning. A computing process that gets “smarter” as it analyzes more and more data
- - Deep learning. A form of machine learning that uses neural networks to model high-level abstractions
- - Cognitive computing. A computing process modeled on the workings of the human brain
- - Image processing. A computing process that transforms the data in an image into a form better suited for analysis
- - Pattern recognition. A computing process that recognizes patterns in the data, e.g., of possible tumors in x-ray images, typically without being able to explain how it recognizes the patterns
- - Visualization. The presentation and use of data in visual forms optimized for human analysis, typically in a rapid process of investigation
- - Predictive analytics. Measures that can be used to predict the future, e.g., that patient enrollment will be insufficient
- - Data mining. An automated or semi-automated process of exploring a large data set for meaning, e.g., looking for correlations between possible causes and effects; can be misused on a clinical trial data set that was created for the specific purpose of testing a study hypothesis
- - Genomics. In this context, the use of large collections of DNA sequence data
- - Bioinformatics. The application of computer technology to the management and analysis of biological information • External/connected data. Data from multiple data sets, which might be combined with clinical study data, to form a more complete picture, e.g., of clinical research site performance
- - Dynamic validation. Real-time validation of new data, especially when more sophisticated than simple field checks like age range
- - Semantics. The assignment of meaning to data, most commonly when attempting to find shared meanings across multiple data sets, for example, the mapping of age in one data set to birth date in another
- - Hybrid query. A question asked of a computer that requires examining multiple types of data, e.g., numbers, text and images. For example, the age, blood pressure, gender, progress notes, and x-ray images of a patient.
- - Terabyte. One trillion bytes, or a million megabytes. Some commercial computing systems can fit all the electronic medical records of a large health system into main memory (roughly 50 terabytes of RAM, not disk storage) for extremely fast processing
One of the most exciting applications of big data technology is to analyze the data generated by wearable sensors, which can be millions of times more voluminous than the data that can be collected in study visits. For example, can QT interval safety data be monitored in real time to detect transitory adverse events? Can real-time temperature measurements track the effect of antibiotics? Can adjustments be made for placebo effects based on temporal patterns?
However, big data technology carries big risks. As the number of sources and types of data increase, risks to confidentiality and privacy increase, since it becomes easier to re-identify masses of anonymized data and harder to predict how the re-identification could occur. Similarly, as data sets get bigger, they become more valuable and, thus, more attractive to hackers. For example, the Chinese government probably has the full records, including biometrics, for 22 million U.S. government employees, which could be combined with other data sets to identify the health status of those employees.
Big data can only get bigger and more sophisticated. More data has been created in the last three years than in all of previous human history, and the trend can only accelerate as, for example, the cost of a whole-genome DNA sequence approaches the current cost of a comprehensive blood chemistry panel. New and unpredictable sources of data will prove useful, e.g., how does your mobile-phone Internet usage reflect your adherence to a drug regimen?
Author: Norman M. Goldfarb is Managing Director of First Clinical Research LLC, a provider of clinical research best practices information services. Contact him at 1.650.465.0119 or ngoldfarb@firstclinical.com.
First published in the Journal of Clinical Research Best Practices. © Norman M Goldfarb