|Home | About | Journals | Submit | Contact Us | Français|
Technologies such as genome sequencing, gene expression profiling, proteomic and metabolomic analyses, electronic medical records, and patient-reported health information have produced large amounts of data, from various populations, cell types, and disorders (big data). However, these data must be integrated and analyzed if they are to produce models or concepts about physiologic function or mechanisms of pathogenesis. Many of these data are available to the public, allowing researchers anywhere to search for markers of specific biologic processes or therapeutic targets for specific diseases or patient types.
We review recent advances in the fields of computational and systems biology, and highlight opportunities for researchers to use big data sets in the fields of gastroenterology and hepatology, to complement traditional means of diagnostic and therapeutic discovery.
In 2003, the completion of the Human Genome Project—which culminated in the public release of the first sequenced and annotated genome derived from human DNA— was heralded as the dawning of the genomic era.1 Since that time, continued technological advances have enabled the rapid and cost-effective analysis of DNA, RNA, protein, and other biomolecules in large cohorts of patients. The integration of multiple types of omics* experiments (see Glossary for terms marked with *) across populations and conditions, made possible by the rapid accumulation of data generated using these technologies, has begun to yield clinically impactful discoveries by reanalysis of the deposited data.2 However, this surfeit of data has also made the analysis of omics studies an increasingly challenging task. Illustrating the scope of this problem, the European Bioinformatics Institute (EBI) reported in early 2016 housing 75 petabytes of publicly-accessible data3 (a quantity that would take more than nineteen years to download on an exceedingly fast 1 gigabit-per-second Internet connection), and between the two major public repositories of genomics data, ArrayExpress4 and the Gene Expression Omnibus5 (GEO), there are nearly two million samples currently available (for an overview of big data resources, see Table 1 and Supplementary Table 1). In addition to transcriptomic and genomic (i.e., DNA sequence/variants) datasets, additional types of omics data, assaying the proteome, metabolome, kinome, methylome, acetylome, lipidome, microbiome, phenome, exposome, meta-genome, and interactome, are increasingly being deposited for public use.
In parallel, the widespread adoption of electronic health records6 (EHR*) has also generated massive amounts of digitized personal health information, as has the increasing popularity of automatic serial data acquisition from wearable devices/technologies*7 and web applications that collect patient-reported health information* (e.g., the www.HepCure.org portal for hepatitis C patients and their physicians). Unprocessed clinical trial data will also soon become more widely accessible. Earlier this year, the International Committee of Medical Journal Editors issued a proposal that, if accepted, will require authors of clinical trials to make de-identified patient data publicly available after a 6 month embargo period, with the intention of increasing transparency and reproducibility of the trial results, and facilitating large-scale secondary analyses by external researchers.8 This and other open data initiatives—including the newly-launched Genomic Data Commons, which aims to serve as a hub for existing and future cancer research data9—will dramatically expand the public store of data available for both mining and integrated meta-analyses.
Developments like these have propelled biomedical research into the era of big data.* Given the hypothesis-free nature of data mining techniques, big data can be used to obtain a global perspective that complements the focused mechanistic studies typical of experimental biology, and enable the detection of high-level information patterns that would otherwise be impossible to perceive. Such approaches will help clarify the pathogenesis and proper classification of complex diseases, which typically involve a wide range of causal factors. For example, a recent integrated analysis of multiple datasets has defined three subtypes of type 2 diabetes that would not have been apparent based solely on clinical assessment.10
Additionally, the establishment of regional or national biobanks* (e.g., UK Biobank, www.ukbiobank.ac.uk) and large multicenter/national consortia (e.g., International Cancer Genome Consortium [ICGC], icgc.org) provide opportunities to more effectively integrate the breadth of human diversity (e.g., age group, sex, race/ethnicity, and environmental exposures) into biomarker and therapeutic discovery.
The potential value of big data in clinical medicine and basic science has been widely acknowledged.11 For instance, due to the practical limitations in designing and implementing randomized clinical trials (RCTs) to address many important clinical questions, the mining of retrospective EHR and other large-scale clinical outcomes data has been proposed as a supplement to RCTs in the generation of practice-guiding evidence.12 A restructured clinical taxonomy—one which moves away from the current organ/symptom-based classification system in favor of molecular descriptions of disease—has also been identified as a critical step toward precision medicine.*13 However, the current integration of molecular science into clinical medicine requires substantial progress before big data can make a meaningful impact on health care. The field of translational bioinformatics* has arisen over the past decade specifically to address this challenge, aiming to harness big data by developing statistical techniques and computational infrastructures capable of integrating and analyzing large, heterogeneous datasets, and ultimately deriving clinically-relevant insights that address unmet diagnostic and therapeutic needs across broad medical disciplines (see Figure 1 for an overview of the typical big data-driven workflow). Compared to a purely experimental approach, the incorporation of data mining and analytics tools into the biomedical pipeline is expected to shorten developmental timelines, reduce costs, and improve the success rate of candidate diagnostic and therapeutic tools (see Figure 2).
Given these advantages, big data-based approaches are likely to have many productive applications within gastroenterology and hepatology, particularly for diseases in which diagnostic methods and/or treatments are imprecise. For example, despite several important therapies for inflammatory bowel diseases, due to our increased understanding of the associated immune dysregulation, its incidence is still increasing worldwide, and a significant proportion of patients do not achieve adequate remission of symptoms. Big data might also be analyzed in studies of gastrointestinal dysmotility, which is difficult to manage, or irritable bowel syndrome and other functional disorders, which are imperfectly understood and are difficult to diagnose and treat. There are also many sequence, gene expression, and proteomic and metabolomics data available on hepatic, colorectal, and gastric cancers, as well as pancreatic adenocarcinoma, that could be used to increase early detection or provide therapeutic targets
In liver, promising areas for data-driven discovery include viral hepatitis, where the complex interactions between viral heterogeneity, host genetic variations, and environmental factors in disease pathogenesis have not yet been satisfactorily integrated; liver cancer (hepatocellular carcinoma [HCC] and intrahepatic cholangiocarcinoma [ICC]), is increasing in incidence, yet therapies for HCC and ICC remain limited; progressive hepatic fibrosis, which likely shares core fibrosis pathways with other fibrotic diseases, yet for which there are no approved anti-fibrotic drugs; non-alcoholic fatty liver disease, in which recent epidemiological studies underscore the risk of developing liver cancer even in the absence of cirrhosis, emphasizing the need to identify biomarkers and targets for the “the next global liver disease epidemic”;15 and acute on chronic alcoholic hepatitis, where mortality is still extremely high (up to 50%) and treatment options are limited, and in which the precise identification of high-risk populations in need of therapy is still challenging.16
In this review, we will cover the currently available resources relevant to big data-driven research, and discuss future prospects for the integration of these resources within the fields of gastroenterology and hepatology.
Biomarkers may be classified as diagnostic, prognostic, or therapeutic response-predictive depending on their intended use. Diagnostic biomarkers are used to determine the likelihood that a patient is suffering from a specific disease. Prognostic biomarkers inform physicians regarding the risk of clinical outcomes, such as cancer recurrence or disease development and progression, which may be used to assist patients and physicians in determining the appropriate aggressiveness of follow-up and/or care. Therapeutic response-predictive biomarkers are more specific because they are used to predict an individual’s response to specific treatments. Biomarker development follows the sequential processes of discovery, validation, and clinical implementation, with the eventual goal of establishing accessible tests that can be used to guide clinical decision making.17,18
Many of the candidate biomarkers reported recently have not been successfully translated into clinical practice, often because they did not pass the rigorous validation phase assessing technical/analytical validity (reproducibility and robustness of measurement) and clinical utility (replicated diagnostic, prognostic, or predictive capability in specific clinical contexts).17,19–21 Optimal study design is a key issue in maximizing the reliable discovery and successful validation of biomarkers. In many cases, sample availability may be limited, and both prospective enrollment and longitudinal follow-up studies to validate biomarkers over time can be costly and challenging to manage. In addition, cultural, environmental, and other variations across populations often necessitate large sample sizes to ensure generalizability, further complicating the design of appropriate studies.
With omics technologies, relatively rare genetic aberrations are increasingly identified as candidate predictive biomarkers of drug response, especially in the field of oncology.22 The therapeutic benefit of experimental therapies targeting such pathogenic aberrations often cannot be detected in the traditional “all comer” clinical trial design, which enrolls patients irrespective of the presence of these aberrations. Alternatively, new clinical trial designs, first stratifying the enrolled patients by molecular tests and then assigning a potentially effective therapy to each individual, have been evaluated (referred as “umbrella” or “basket” trial design).23 However, it is worth noting that performance of the biomarkers, e.g., positive/negative predictive value, should ideally be well defined prior to conducting biomarker-enriched clinical trials to ensure proper interpretation of a therapeutic benefit. The emergence of publicly available randomized controlled trial data (especially those that include -omic characterization of study participants) may allow post hoc assessment of predictive biomarkers before adopting them in prospective biomarker-enriched trials. Detailed molecular characterization of extreme responders has also been explored as an option of clinical biomarker-drug testing.24
Emerging public and private big data resources (including those listed in Table 1) will help overcome these challenges by enhancing the availability of data and/or samples. As the diversity and comprehensiveness of patient cohorts in these databases expands, ‘virtual’ patient enrollment will shorten the discovery process, improve reliability, and reduce costs of biomarker assessment in clinical trials through the incorporation of in silico validation* (Figure 3). The National Cancer Institute (NCI) has recommended improved sharing of existing specimens and data to create a NCI-wide inventory of specimens and cancer diagnosis data, and is funding pilot projects to support these efforts.25 The Institute of Medicine also encourages public sharing of clinical trial data while minimizing the risks and burdens of sharing.8,26 Although a number of ethical and legal challenges remain, these and related publicly–supported efforts to share data will accelerate biomarker development.
Within gastroenterology and hepatology, big data-driven approaches have identified several promising biomarkers (Table 2). Meta-analyses of publicly available genomic datasets have identified colorectal cancer diagnostic biomarkers,27 as well as molecular signatures that sub-classify pancreatic cancer,28 HCC,29 and colorectal cancer.30 An analysis of multiple transcriptomic profiles from cell types within liver has yielded a 122-gene signature that defines the presence hepatic stellate cells in fibrotic livers and correlates with clinical outcomes.31 Multi-cohort transcriptome analysis has also identified and validated a 186-gene hepatic signature predicting increased HCC risk and poorer prognosis in cirrhotic subjects.32–34 A reanalysis of 2,000 publicly available colorectal cancer transcriptomic profiles has led to identification of CDX2 as a prognostic biomarker predictive of disease-free survival, as well as predictive biomarker of response to adjuvant chemotherapy in stage II/III disease.35
Nevertheless, very few of the candidate biomarkers described in the literature have been implemented in clinical care because of several obstacles, including costly and lengthy assay development and prospective clinical utility validation, uncertain intellectual property regulations to protect omics/big data-driven biomarkers, and an unclear path toward regulatory approval and reimbursement.22 Instead, optimizing and validating biomarkers in silico using big data resources could reduce time and cost, and substantially lower the bar for clinical translation of molecular biomarkers.
Integrating big data analytics and validating drugs in silico has the potential to improve the cost-effectiveness of the drug development pipeline. Here we review the two major drug discovery approaches—de novo development and drug repurposing—and the related computational techniques and resources that support them.
Despite enormous investment in research and development (R&D) within the pharmaceutical industry, the rate at which new drugs are approved has not meaningfully increased over the past two decades.36 Further, the cost of developing a new drug remains high, ranging from $3 billion to more than $30 billion per approval between 2006 and 2014,37 reflecting the complex challenges involved in meeting current scientific, regulatory and commercial requirements. An over-reliance on in vitro high-throughput drug screening (HTS)* and the “one-drug-one-target-one-disease” concept is cited by some as a contributing factor in the abundance of late-stage R&D failures in recent years, many of which were the result of poor efficacy and unexpected toxicity of lead compounds developed using HTS technology.38,39 In contrast, certain experimental systems identify candidate drugs based on higher-level readouts of pharmacologic activity, in order to predict the effects of a compound in vivo.40,41 Phenotypic screens using animal or cell-based models of disease offer improved performance in this regard, but come with their own set of drawbacks, including relatively low throughput, high expense, mechanistic uncertainty, and limited coverage of the full spectrum of human disease.
Big data-driven strategies are being increasingly used to address these challenges. Computational prediction of drug toxicity and pharmacodynamic/pharmacokinetic properties, based on integration of multiple data types, helps prioritize compounds for in vivo and human testing, potentially reducing costs.42 In particular, computational exclusion of drugs that are likely to be toxic, prior to clinical assessment, will enhance patient safety while minimizing delays and expense, since drug toxicity is a major reason for failed clinical trials. For example, IL-17-targeting therapy, which has efficacy in rheumatic diseases, was found to be ineffective—and even harmful—in IBD, contrary to expectations based on the similar inflammatory features of these conditions. Global readouts of drug activity are expected to help clarify the causal relationships in such cases.
Similarly, chemical structure-based prediction of pharmacologic activity can identify more potent candidate compounds.43 Large-scale compound library screening datasets and cheminformatics* tools deposited in publicly available databases can enable in silico reanalysis for virtual drug exploration. The characterization of global transcriptional changes has been widely proposed as a universal readout to quantitatively assess disease states and drug responses. This approach allows drug-disease matching in a high-throughput, low-cost, and mechanistically revealing manner, while still providing the organism- or organ-system-level view of disease missing from target-driven studies.
A complementary approach to the discovery of new compounds is drug repurposing (also called drug repositioning), which entails the discovery of new indications for existing drugs. To date, successful drug repurposing has largely resulted from serendipity rather than systematic exploration.44 A classic example is sildenafil, which was repurposed from use in angina to erectile dysfunction based on an unexpected clinical effect. Similarly thalidomide, in spite of its well-known teratogenicity, was successfully repositioned as an effective treatment for multiple myeloma and leprosy.45 As public big data continues to accumulate, computational screening methods will foster a more systematic and comprehensive approach to drug repurposing (an example of the repurposing pipeline is outlined in Figure 4).
Conceptually, drug repurposing can be viewed as an optimization of the pharmacopoeia, aiming to maximize therapeutic efficiency within a fixed catalog of drugs and diseases.46 As such, repurposing has several attractive features as a complement to de novo drug development. First, the costs and time requirements associated with drug repurposing are greatly reduced,47 particularly for medications that have already been approved for clinical use in another indication or have cleared safety issues in phase I clinical trials.48 Additionally, in the proper clinical setting, off-label use prior to regulatory approval could further reveal the full clinical potential of repurposed compounds in a time- and cost-efficient manner. Second, reduced financial and regulatory barriers make drug repurposing an attractive option for rare and neglected diseases, which are generally less likely to be targeted by pharmaceutical companies due to lower profit potentials.49 The Orphan Drug Act, which incentivizes the development of drugs for rare diseases, has increased industry interest in this area, as has the recognition that so-called “niche busters” may mitigate the financial risk of pursuing large-market blockbusters,50 but there is still a significant unmet worldwide need. Drug repurposing may therefore serve a critical role in bring valuable treatments to underserved patients and populations. Third, the growing availability of publicly-accessible cheminformatics* data and advanced computational tools is allowing academic researchers to assist and even replace industry partners as the primary drivers of drug repurposing efforts.51 Finally, big data-based drug repurposing will be closely aligned with precision medicine, which has recently been established as a national priority.52 As stated earlier, increased characterization of the molecular mechanisms of disease has led to a rethinking of the traditional clinical taxonomy, moving from symptom-based descriptors to a molecular classification system.13 Omics-guided drug repurposing aims to discover molecular taxonomy-based therapeutic indications, which is an integral goal of precision medicine.
The recent explosion of omics data has radically changed approaches to therapeutic discovery, particularly for drug repurposing. Cost-effective, high-throughput technologies can now characterize disease states at multiple levels to generate a multidimensional molecular “disease signature”.*44 Such a signature may include transcriptomic, proteomic or other changes as functional readouts of disease activity. In parallel, several search engines, most notably the Connectivity Map (CMap)53 and Library of Integrated Network-based Cellular Signatures (LINCS),54 have cataloged the effects of pharmacologic compounds on a variety of cell types. These databases may be queried in order to identify candidate compounds that are likely to either reverse a disease signature—a technique known as “signature inversion”44—or mimic desirable changes. Given that these databases contain data on many currently approved drugs, signature inversion studies can rapidly identify repurposing candidates while prioritizing widely available generic drugs. Several notable examples of this approach include the identification of topiramate as a potential treatment for inflammatory bowel disease,55 chlorpromazine, trifluoperazine56 and prenylamine57 to treat HCC, citalopram as a candidate therapy for colorectal cancer,58 and the HDAC inhibitor vorinostat for gastric cancer.59 Taking this concept further, Suthram et al derived a network of disease signatures mapped onto protein interaction data, and identified a subset of modules that were common to many diseases.60 Importantly, these shared modules were enriched with druggable proteins, confirming the potential of transcriptomic data to identify therapeutically relevant targets.
Genome Wide Association Studies (GWAS)—linking germline DNA polymorphisms with clinical phenotypes—as well as expression quantitative trait locus (eQTL) analyses that integrate gene expression data, have enabled identification of numerous disease susceptibility genes and more efficient discovery of therapeutic strategies, without relying on a priori biological hypotheses. These advances have been a driving force in the creation of large patient cohorts accompanied by archived biospecimens and omics databases.61 Similarly, the study of somatic DNA structural and/or chemical alterations has greatly accelerated cancer drug discovery and development.62
It should be noted that there are a number of challenges when analyzing omics data, including high dimensionality (i.e., many more variables than samples, which can lead to the emergence of spurious associations),63 marked heterogeneity in data attributes (e.g., diversity in assay platform, experimental conditions, analytical methodologies, etc.),64 and an imperfect concordance between different types of biomolecules (e.g., mRNA transcripts and the corresponding translated proteins).65 Other types of data, such as chemical modifications, enzymatic activities, and genotypes, can also be integrated as additional inputs for a multilayered omics characterization,66–68 adding valuable information but also increasing the complexity of analysis.
Omics-based big data may also be used to link drugs and indications by making novel connections between distinct data types and domains (the “guilt-by-association” approach).44,69 These connections can be made in a number of different ways (e.g., common molecular dysregulations in disease states, shared indications between apparently unrelated compounds,70 and drug side-effect overlap71), based on the assumption that one type of similarity implies another. Unexpected drug side effects— which may or may not be undesirable, depending on the clinical context—provide a rich source of functional drug information, both as a means of discovering related groups of compounds through their shared effect profiles,72 or by directly leading to the identification of repurposing applications (for example, a drug with a side-effect of urinary retention might be rationally repurposed to the treatment of urge incontinence).73 To facilitate side-effects-driven analyses, Kuhn et al compiled a database of known drug-side-effect associations in an easily mineable format, including side-effect frequencies for many drugs.74 Additionally, several user-friendly tools have been developed for the quantification of disease and drug similarity (see Table 1, Supplemental Table 1, and 75). There have also been recent attempts to integrate multiple types of disease-disease and drug-drug similarity within a single analytic pipeline for drug discovery.76–80
Human diseases are generally the result of complex interactions between a variety of biomolecules within a multi-scale biological system, ultimately conspiring to bring about the organism-level phenotypes that are observed clinically.81 Network biology/medicine*, which conceptually represents different aspects of the cellular environment as nodes, connected by edges and spatially interrelated into modules, attempts to capture the full complexity of biological interactions on the system level. Compared to a reductionist approach, network medicine is thought to better recapitulate the fundamental biological processes and machinery—rather than their components— which bring about complex disease states. A central implication of the network hypothesis is that many diseases with multi-level causal event interactions will be more effectively treated by promiscuous (i.e., multitarget) drugs than by highly-selective individual compounds, since multiple targets are engaged to ameliorate the disease phenotype.82
In addition to providing an intuitive means of representing the interactions that drive cellular biology, however, networks also enable the use of sophisticated computational comparisons, a reflection of the field’s mathematical roots as a derivative of graph theory. For instance, a curated network of drugs, proteins and side-effects may be used to visually explore the topological connections between compounds, which would otherwise be difficult to chemically relate.83 Similarly, networks have been used as a means of relating drugs and disease through common protein interactions.84
Omics data may also be integrated into networks. For example, Iorio et al generated a drug network by iteratively refining relationships between drug-induced transcriptional signatures, ultimately discovering “drug neighborhoods” which implied shared mechanisms and indications for member compounds.85 Networks additionally provide a convenient means of combining disparate data types into a unified analysis. In a recent study, Menche et al. attempted to relate diseases by topological network parameters using an integrated “interactome”, which they compiled from all known intracellular interactions. The authors considered diseases to be related if they were relatively adjacent on the network, and were able to demonstrate significant associations between this network proximity and externally-derived disease features, including gene expression and symptomatology.86
There is a vast amount of knowledge contained in the published literature, far more than could be assimilated by a single investigator through traditional means. Further, research often congregates in silos of specialized knowledge, limiting the dissemination of concepts between disparate areas of investigation. Literature mining,* which originally developed from Swanson’s ABC model (i.e., if A is connected to B, and B is connected to C, there is a further implied connection between A and C87), now aims to extract information from highly diverse semantic contexts through the use of natural language processing algorithms, and has recently been adapted specifically for drug discovery and repurposing. For example, an analogy-based literature mining approach was able to successfully predict the in vitro activity of nearly one-third of a small molecule library against prostate cancer cells, with the added advantage of uncovering plausible mechanisms of action.88 Similarly, an automated reasoning algorithm connected drug-target information obtained through database and literature mining with cancer target information, showing a significant ability to recover known drug-cancer connections.89 Further, literature mining can be used as a means of adding information content to separate analytic pipelines. For example, Gramatica et al used a combination of literature mining and graph theory to construct a network-based model capable of identifying non-obvious connections between drugs and diseases based on topological parameters.90 User-friendly literature mining tools have been developed; for instance, PolySearch is a web-based text mining application allowing users to quickly identify connections between a variety of biological entities, including drugs and diseases, using information drawn from a variety of literature sources and curated databases.91
The virtual screening of compounds based on molecular structure has advanced significantly over the past two decades. This approach seeks to predict likely interactions between drugs and target proteins, based on their respective structures. In silico structural screening can be valuable for both de novo drug design, by enabling promising compounds to undergo an initial selection process prior to experimental screening and validation, and for drug repurposing, being far more cost-effective than experimental HTS methods.92 Additionally, virtual screening can be conducted in either “forward docking” (screening a protein target against a library of compounds) or “reverse docking” (in which individual compounds are screened against a library of protein targets) formats, facilitating both disease- and drug-specific discovery.92,93 Several web-tools have been implemented to enable the prediction of connections between drugs and target proteins using forward/reverse docking (or the related concept of pharmacophore* mapping) for user-input source compounds.94–97 For instance, an online tool specifically designed for repurposing studies has used virtual docking to compare input compounds against a library of molecules with known indications and side-effects, helping to predict the potential uses and adverse effects for compounds of interest.98 An alternative approach, comparing compound structures against sets of ligands known to bind a variety of target proteins, has been used to predict several novel drug-protein connections.99 A related structural-similarity-based screening tool, TargetHunter, was implemented to enable the identification of multiple targets for a given compound of interest.100
Quantitative structure-activity relationship (QSAR) algorithms attempt to predict the therapeutic, toxic, and pharmacologic activities of compounds by inferring likely physicochemical properties from the compounds’ molecular structures.101 A user-friendly machine learning platform, AutoWeka, has been developed to aid in the implementation of QSAR studies.102 Other web-based tools are available for the prediction of pharmacodynamics/pharmacokinetic and toxicity profiles of compounds based on structural input data.103,104 Recently, several QSAR-based methods have been used in the discovery of novel agents for the treatment of IBD,105 uncovering compounds capable of inhibiting NF-κB106 and TNF-α converting enzyme,107 two molecules implicated in disease pathogenesis.
With the enhanced ability to predict systems-level disease pathogenesis and drug effects, the rational design of combination therapies has become feasible. There are several potential advantages of drug combinations: better coverage of multiple disease mechanisms than with a single agent;108 dose reduction for a potentially toxic component of the combination while maintaining therapeutic efficacy;109 synergistic effects, including synthetic lethality in cancer110; and the prevention of innate and acquired drug resistance.111 Efficient, high-throughput identification of effective drug combinations is therefore an important component of a successful therapeutic discovery pipeline. To this end, a number of methods for computational prediction of synergistic compounds have been developed,112 based on side-effect profiles,108 chemical and pathway data,113 network analyses,114 and drug-induced gene expression patterns.109 Curated databases of reported drug combinations and other resources are also publicly available.115
Machine learning*, an automated form of data modeling and inference, has become a dominant tool in computational drug discovery, particularly in the implementation of QSAR studies.116 Machine learning algorithms are generally categorized either as supervised (i.e., guided by external information) or unsupervised (i.e., exploring inherent data structure), with each individual algorithm within these categories having specific advantages and disadvantages.117 Traditional machine learning algorithms, such as support vector machine (SVM), have been successfully utilized in drug development analysis by integrating drug and protein structures, disease states, and drug toxicity for repurposing and sensitivity prediction.118 More recently, deep learning, which uses multi-layer artificial neural networks to extract meaning from data, has shown promise because of its robustness when working with complex, heterogeneous datasets.119 For example, a recent deep neural network analysis was able to categorize drugs into therapeutic categories using pathway-enriched transcriptional signatures, with improved predictive performance compared to SVM.120 Another study used deep learning to predict drug toxicity by correlating molecular structure to the risk of drug-induced liver injury.121 The simultaneous use of multiple machine learning algorithms (a technique known as ensemble learning) has been shown to combine the advantages of individual algorithms while minimizing their weaknesses, although the improved performance of ensembles often comes at the expense of computing time and poor interpretability.122 An ensemble learning approach recently proved effective in the prediction of drug sensitivity within a variety of human cancer cell lines.123
Although machine learning is showing great promise within drug discovery pipelines, the vast size and complexity of future big data could easily exceed the capability of currently available computational infrastructure. To address this challenge, several statistical measures that require less computational expense, such as the maximal information coefficient (MIC), have been developed to promote the efficient handling of big data.124 In parallel, development of computational infrastructure that can be rapidly expanded, such as the Hadoop file system,125 crowdsourcing,126 and massively parallel processing hardware (including the recruitment of graphical processing units127), are being actively explored and adopted.
There are several challenges involved in the integration of a big data-driven pipeline for biomarker and drug discovery within gastroenterology and hepatology. As previously mentioned, the statistical complexity involved in the analysis of large, heterogeneous datasets is a major stumbling-block in the successful generation of data-driven discoveries.128 Adding layers of omics information may facilitate the identification of better molecular correlates, but potentially at the expense of larger required sample sizes to achieve proper statistical power. This in turn depends on the strength of association between molecular dysregulations and the phenotype of interest. In addition, significance assessment can be a challenging task when analyzing too many features. Thus, it is critical to: perform proper sample size calculation to ensure sufficient statistical power; correct for multiple hypotheses when estimating statistical significance; and reduce the dimensionality of molecular datasets by filtering out less informative features, extraction of representative information from multiple features (e.g., principal components), and/or prioritization of small subsets of molecular features for the analysis based on prior biological knowledge. Most importantly, external replication of the findings is the key because it is not practically feasible to eliminate the false positive associations that occur in a single high-dimensional dataset.
In addition, in order to fully realize the potential of big data in the clinical sphere, there is still a need for more and better data. Many diseases lack sufficient molecular characterization, and existing datasets are only infrequently linked to specific clinical features and outcomes. Currently available patient-derived omics data are heavily biased towards accessible organ systems such as blood and surgical tissues. In contrast, for example, there are very few genomic datasets for advanced HCC because tissue acquisition is not recommended as part of routine clinical practice. Additionally, the uneven quality of publicly available data can make valid interpretation difficult. A number of quality control measures have been devised for omics experiments, but integrating data from multiple experiments and/or different technological platforms and experimental conditions is an ongoing problem. Technical or clinical variation between individual experiments (so-called “batch effects”) can obscure or spuriously mimic the biological changes being sought through integrative analysis.64 It is therefore critical that publicly available data include detailed technical information regarding all factors that might contribute to experimental and clinical variation. On the other hand, given the growing body of big data automatically collected from sources like wearable devices and EHR, perfect data curation and quality control is an unrealistic expectation, and instead novel methods will need to be developed that are less sensitive to data heterogeneity. Finally, it should be emphasized that while big data-driven approaches promise to accelerate the discovery of new therapies and diagnostics, all computational predictions must still be thoroughly validated in experimental and clinical settings prior to general use.
We are moving towards big data-based healthcare, including data-driven methodologies to accelerate the discovery of new diagnostics and drugs. To maximize the benefit of these big data-based approaches in gastroenterology and hepatology, it will be essential for clinical researchers to systematically collect specimens and clinical information in order to create centralized, comprehensive repositories of mineable data to address unmet needs. Routine collection of omics data such as whole genome or exome sequences may be an option once it is proven to be a cost-effective approach. This will require not only the incorporation of omics technologies into the clinical toolkit, but also the creation of medical information systems to regularly collect, curate, and analyze the data, and deliver results and interpretation to the clinic. Regulatory mechanisms for patient privacy protection that do not unrealistically hamper conduct of big data-based research will also be another critical requirement for the realization of precision medicine. Such plans are already evolving using secure, open formats, and investigators conducting clinical trials should become familiar with these resources.129 At the same time, a new breed of scientists and clinicians must emerge who are facile with big data approaches and can translate these data into novel biomarkers and drugs that prevent disease or improve the outcomes for patients with gastrointestinal and liver illnesses.
Grant support: This study was funded by National Institute of Health Grants 5T32DK007792, R01DK099558, RO1DK56621, the FLAGS foundation, the Nuovo-Soldati Cancer Research Foundation, an advanced training grant from Geneva University Hospital, the Irma T. Hirschl Trust, and the European Union grant ERC-AdG-2014 HEPCIR.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Author contributions: All authors participated in the composition and editing of this work.
Disclosures: The authors have no relevant conflicts.
Author names in bold designate shared co-first authorship.