|Home | About | Journals | Submit | Contact Us | Français|
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment can simultaneously measure hundreds to thousands of individual features (e.g. genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global demand, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code. Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time and money on novel projects. The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge. In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.
Over the past two decades, the biomedical field has been transformed by the advent of new high-throughput technologies such as gene expression microarrays, protein arrays, flow cytometry and next-generation sequencing. Experiments and protocols have become increasingly complex, involving the use of instruments that can be very sensitive to specific settings. For example, small changes in the photomultiplier tube voltage of a flow cytometer or a microarray scanner could drastically change the output of an experiment . It is thus crucial that protocols be well described, standardized and shared in order for an experiment to be reproducible and comparable within and between laboratories.
Furthermore, these novel biomedical technologies generate large high-dimensional data sets from individual experiments. The growth of such data has highlighted the importance of implementing data management and analysis plans as an integral part of experimental design. In consequence, data analysis procedures contribute significantly to the reproducibility or non-reproducibility of an experiment or publication. Unfortunately, as of today, too many published studies remain irreproducible due to the lack of sharing of data, computer code or software required to reproduce the study results. This lack of reproducibility has had significant impact, leading to the halt of a cancer clinical trial when key gene expression signatures used for decision making were found to be caused by analysis errors and could not be independently reproduced by researchers . Had the data and computer code been made available, the results of the study could have been invalidated more rapidly, which could have saved funding, avoided giving patients false hope and most importantly ensured patients received effective treatment . Fortunately, over the past decade, computers, software tools and online resources have drastically improved to the point that it is easier than ever to share data, code and construct fully reproducible data analysis pipelines.
In this article, we review some of the fundamental issues involved in the comparability and reproducibility (C&R) of biomedical data going from assay standardization to reproducible data analysis. Our intent is not to exhaustively review all possible problems with all existing assays, but rather to select a few concrete examples based on our own experience and present some thoughts and solutions toward the overall concept of C&R. This article is divided into two main sections, one related to the experiment reproducibility and one to the analysis reproducibility, though the two topics significantly overlap.
We examine a prototypical biomedical data generation process to illustrate factors that may negatively impact the C&R of the data throughout different stages of the process. As shown in Figure 1, a data generation process can be roughly broken down into three core stages (Steps 1–3) of information transformation from signals contained in biological samples to numeric values captured in data sets for analysis. In Step 1, biological samples are measured and raw instrument data are generated. There are several factors that may influence the C&R of data at this stage. These include some obvious factors such as the specific type of technologies (e.g. hybridization-based or sequence-based gene expression) [4–7] or platforms (e.g. Affymetrix, Illumina or Operon) [8–12], the Standard Operating Procedures (SOPs) for biological sample preparation, experimental design, experiment layout and measurement [13, 14], as well as other conditions that are often not specified in the experiment protocol. For example, the level of experience or expertise of the technicians performing the experiment [15, 16], or the origin of the reagents (e.g. batch effects [17, 18]) are also possible sources for differences between independent experimental results. Therefore, in Step 1, to increase the C&R of data, all these factors should be thought out and optimally controlled and standardized whenever possible. When factors such as technicians or reagent batches may not be standardizable across multiple studies or laboratories, a measuring system comprised of a specific platform using a specific technology should strive to minimize variations caused by these factors and increase robustness against changes in these factors. Whenever possible, the SOPs should be shared and made available to the community. Several online platforms are now available for storing and sharing such information including ‘elabprotocols’ (elabprotocols.com) and ‘figshare’ (figshare.com). In Step 2, raw information from an instrument is calibrated and quantified into numeric values. This step often involves image analyses for information alignment and/or dimension reduction. Consequently, the specific algorithms used to make such transformations, their implementation in software and the specific data storage structures, including data formats (i.e. databases or flat files) and variable naming conventions, are vital to maintaining data consistency and should be standardized and recorded to a maximal level for effective C&R of the data. We will refer to the data derived from this step as primary data versus the secondary data generated after Step 3. In some specific cases, primary data are derived directly from the instrument, but in many cases the extremely large size of the raw data (e.g. raw images) makes it prohibitive to share these and the lack of true raw data is accepted. In ‘Standards and Data Sharing’ section, we provide more discussions on data standards and data sharing. Finally, in Step 3, data from Step 2 are further (pre-) processed before study objective-driven analyses are conducted. This later step often involves further data alignment such as background adjustment or data aggregation such as per-biomarker summarization from multiple subset measurements. Certain quality assurance and control processing may also occur to remove unreliable data and reduce any systematic variations between data points. As in Step 2, the specification and implementation of the algorithms and the data storage structures should be tracked in the effort to maintain the C&R of the data. In ‘Reproducibility of Assay Results and Derived Data’ section, we will discuss some of the tools available to share Step 2 data and associated computer code for data processing and analysis.
We use accuracy and precision as two building-block metrics to illustrate the concept of C&R. While the exact definition of C&R may vary depending on the context, accuracy and precision are two well-defined statistical concepts. Specifically, accuracy indicates how close a measurement is to its true (actual) value, whereas precision indicates how close measurements are to each other. Deviation from accuracy (i.e. bias) is often introduced by systematic sources of error. For example, factors mentioned earlier such as the measuring system or a poor reagent may be a primary source of bias that cannot be removed by repeating or averaging large numbers of measurements. On the other hand, precision (i.e. variability) of data can generally be improved by increasing the number of measurements. For this reason, biological and technical replicates are recommended in an experimental design to help distinguish biological variation from technical variation. In general, there is a trade-off between accuracy and precision, in the sense that one cannot optimize both simultaneously. For example, in microarray image analysis, spots can either be summarized by the estimated foreground intensity or the background-corrected intensity (foreground minus the background). Foreground intensities are typically less variable but can exhibit higher bias compared with background-corrected intensity. In this context, many research groups have proposed pre-processing techniques that aim at finding a good compromise between the two . A hypothetical example is shown in Figure 2, where comparable and reproducible data do not necessarily require unbiased measurements as long as they are ‘consistently inaccurate’ (Panel C). Imagine a hypothetical gene expression device that always measures the expression of a gene as being zero. The experiment is highly reproducible but completely biased and thus useless. It is not atypical for an experimentalist to compute a coefficient of correlation between two series of experiments and to be very pleased when he/she obtains a value close to 1. Unfortunately, the large correlation could be explained by the fact that the measurements are biased and both are correlated with the same experimental artifact. So it is important that when C&R is evaluated, accuracy is also taken into consideration. Therefore, to ensure meaningful integrative analysis of biomedical data from multiple sources, although there may be issues of reliability, we encourage the inclusion of a well-established ‘gold standard’ of measurement whenever possible such as the inclusion of ‘established’ positive and negative controls that provide reasonable upper limits on the sensitivity and specificity of the experimental measurements. In this way, any signals identified from comparable and reproducible data can also be scrutinized against the gold standard for true scientific values.
In the presence of possible experiment-specific bias, data pre-processing methods can be used to improve C&R. It is common practice to reduce non-biological sources of variation via pre-processing techniques such as background correction, batch effect removal or normalization. Many of these methods were established during the early days of microarrays at a time when experimental procedures were still being optimized and technical variability was omnipresent. Such methods include lowess normalization , quantile normalization , ComBat , SVA  and RUV-2  for batch effect removal and gcRMA for removing non-specific binding of oligonucleotides , to cite a few. Due to the positive impact these methods have had on C&R, many other fields have adopted similar pre-processing techniques, e.g. flow cytometry  and next-generation sequencing . Most of these methods rely on the assumption that the majority of biomarkers (genes or proteins) are not differentially expressed and the numbers of up-and down-regulated biomarkers are roughly equal across samples. Such an assumption can be reasonable when the dimension of the biomarkers collected in each sample is large but may not be satisfied in lower dimension biomedical data. In the latter case, internal or external validation data are usually used to correct for experimental bias that may be related to measurement, instrument or sampling design . When there is a lack of standard for a quantity’s true value  and validation data are infeasible to generate, calibration methods based on paired samples  can be adopted to adjust for experiment bias. For example, in the field of flow cytometry true gold standards do not exist yet and it is thus difficult to evaluate C&R. The FlowCAP group (flowcap.flowsite.org) is currently working with the Human Immunology Project group  to derive objective criteria and gold standards that will be used to standardize and evaluate pre-processing of flow cytometry data.
As data sets get richer with more data points, more variables and more metadata, it is important to define standards that can be used to capture and distribute all necessary information toward achieving reproducibility . Several standards have been proposed for biomedical data to achieve these goals including MIAME for gene expression , MINSEQE for sequencing experiment , MIATA for T cell assays  or MiFlowCyt for flow cytometry . In addition to assay protocol information, primary and secondary data, it is important that any pre-processing done to the data be fully described (e.g. normalization for microarrays). Unfortunately, too many assays are still lacking data standards (e.g. bead array multiplex assays) or if data standards are available, manufacturers and/or software companies have been slow at adopting them. For example, despite the availability of data standards for defining preprocessing for flow cytometry, no analysis software has yet fully adopted this format and it is very difficult to share reproducible analyses across software platforms. We, the flow informatics community, basically had to reverse engineer commercial software file formats and write custom open-source software that can read these .
Funding agencies have been very supportive to the creation and adoption of standards for biomedical data, by funding many of the standards that are existing. For example, as part of the Human Immunology Project Consortium (HIPC), a project funded by the NIH, we and other bioinformaticians are currently working toward the definition of novel standards for immunological data. Similarly, the Collaboration for AIDS Vaccine Discovery (CAVD), funded by the Bill and Melinda Gates Foundation (BMGF), has set up an immune monitoring consortium to establish validated T-cell and antibody immunological assays across a network of Good Clinical Laboratory Practices-certified laboratories that could monitor the anticipated pipeline of HIV vaccine trials emanating from the field. Once data and data formats have been standardized, it is important to make these data publicly available for the benefit of science, and to this extent, funding agencies have an important role to play. Most funding agencies including the National Science Foundation and the NIH clearly encourage investigators to share data and/or have defined policies to this end. Similarly, charitable organizations such as the BMGF and the Wellcome Trust are also actively working with grantees to maximize the amount of data available to the research community. Example projects that have good data sharing policies and have setup databases for sharing data, that we are personally involved in, are the HIV Vaccine Trials Network (HVTN), HIPC and the CAVD. In addition to helping retrieve data more efficiently (e.g. via queries), databases can help minimize human errors in data manipulation by ensuring that raw and processed data along with metadata are automatically uploaded with minimal manual intervention. Databases can also help maintain data consistency by checking that some standards are followed or by doing basic data quality checks. For example, the Immunological Portal database (ImmPort.org) provides data templates that help investigators upload their data in a standardized format. It is thus a good idea to use specialized databases whenever possible to store and share data. Despite this global effort, many policies are still either too vague or not properly enforced and data are treated as the private property of investigators who aim to maximize their publication record at the expense of the widest possible use of the data. This situation threatens to limit both the progress of the related research and its application for public health benefit. We feel that it is important for funding agencies to set stricter and clearer data sharing policies, particularly for sensitive data (e.g. individual genomes and clinical data) where policies are often vague or industrial partnerships make the creation of such policies very difficult. In these cases, despite their sensitive nature, these data could and should be shared as long as they are properly de-identified to protect the patients identity under the Health Insurance Portability and Accountability Act.
Once data and all necessary information are made available, these data need to be appropriately cited when the study and its results are published. To this end, it is crucial that journals set data sharing policies or guidelines and that authors do follow these guidelines. Unfortunately, as mentioned in a recent study , too few journals have clear policies for data deposition and even fewer make it mandatory for publication. That study found that even when data deposition is a requirement, the majority of authors did not fully follow the instructions. For example, it is common for researchers to share processed data only, which makes it nearly impossible to reproduce the results or use different analysis tools that require primary data. For example, in the field of genomics, many researchers share processed sequence file formats (e.g. wiggle files), which prevents anyone from analyzing the data with an algorithm that requires primary data (e.g. raw or aligned reads).
Here, we discuss some of the tools available to researchers to perform reproducible analysis and share processed data, computer code and final results as detailed in the following subsections and summarized in Table 2. Analysis of data issued from high-throughput experiments can be extremely complex, involving multiple steps from data formatting and pre-processing to statistical inference. Thus, it is important that all steps be recorded for full reproducibility as shown in Figure 1 and Table 1 (Steps 2–5). This can be difficult to do with a point-and-click software interface, where there is no easy way to save intermediate results. This is not to mention the fact that the ‘manual’ analysis of a high-throughput data set typically requires the use of multiple software tools and is very time consuming. In addition, it is not clear how robust the conclusions of a study are to small perturbations in any of these analysis steps. As such, it might be a good idea to be able to quickly redo an analysis after tuning some parameters to optimize the analysis; something that is not practical within a point-and-click environment.
In recent years, several open-source, community-based projects have emerged that enable researchers to construct and share complete and fully reproducible data analysis pipelines. The Bioconductor project , based on the R statistical language , provide >500 software packages for the analysis of a wide range of biomedical data, from gene expression microarrays to flow cytometry and next-generation sequencing. These packages can be combined via scripts written in the R language to form complex data analysis pipelines, connect to data repositories and generate high-quality graphics. The resulting R scripts can then be used to record and later reproduce the analysis (along with all input parameters). Because all steps of the analysis are automated when the script is executed, it is easy to assess the robustness of the results when tuning some parameters. Other similar projects with perhaps more focused capabilities include BioPython  and BioPerl  that are based on the Python and Perl languages, respectively (to our knowledge, neither BioPython nor Perl have tools for the analysis of flow cytometry data).
Even though several graphical user interfaces (e.g. RStudio for R) are available for writing computer scripts based on R/Bioconductor (or BioPerl, BioPython), the learning curve can still be steep for novice users. More user-friendly-based tools are now available to construct reproducible data analysis pipelines using combinations of available modules that are for the most part wrappers of packages written in R, Perl or Python (or some other language). For example, a popular platform for gene expression analysis, GenePattern, versions every pipeline and its methods, ensuring that each version of a pipeline (and its results) remains static . A more recent project, GenomeSpace (genomespace.org), funded by the National Human Genome Research Institute, can now combine GenePattern with other popular Bioinformatics tools including Galaxy, Cytoscape and the UCSC genome browser. As such, users can perform all of their analysis using a single platform. In the clinical and immunological field, LabKey Server is a popular web-based tool for storing immunological data (via a database) and building complex analysis pipelines that can be shared with other users . LabKey Server also versions every pipeline for full reproducibility. LabKey Server is currently being used by large research networks including the CAVD, the HVTN and the Immune Tolerance Network, to name a few.
In the same fashion that experimental protocols need to be published in order for an experiment to be reproduced, computer code, software and data should also be published along with the results of a data analysis. Ideally, software would be open source and computer code would be well packaged and standardized to facilitate exchange and usability. Both Bioconductor and GenePattern, mentioned earlier, provide facilities for users to package and share code with other users. Bioconductor is based on the R packaging system, which is highly standardized and has been a driving force behind the wide adoption of both R and Bioconductor. Bioconductor goes even further by: (i) ensuring that all submitted packages are peer-reviewed and (ii) providing version control repositories and build systems where source code is maintained, versioned and binaries automatically built for all computer operating systems. Among other things, the peer-review process ensures that the package follows some basic guidelines, are well documented, work as advertised and are useful to the community. The open-source and versioning system provides full access to algorithms and their implementation, which are crucial to obtain full reproducibility. For users who want to version and share software code outside of the Bioconductor (or similar) project, there exist many, free web-based hosting services to store, version and share code (and even data). One of our favorite platforms is GitHub, which the company markets as ‘Social Coding for all’. GitHub makes it easy for anyone to store and version control computer code, packages, documents, webpages and even wikis to document their code. The social aspect of GitHub makes it easy for users to work in teams on a common project, software or manuscript. GitHub is free for all open-source projects.
Unfortunately, very few journals have code/software sharing policies and even fewer have requirements that the code/software be open access. For example, BMC Bioinformatics only has policies for software articles and even for these the source code is not required, only an executable. PLoS One requires authors of manuscripts in which software is the central part of the paper to release software and make code open source for submission. Although this policy is clearer, it is still up to the editor/reviewers to decide whether software was a central part of the paper. In a day and age where most experiments generate large amount of data, software is always going to play a central role, so why not make this policy universal for all submissions involving data analysis? Fortunately, based on our own experience, we feel that reviewers are pushing in the right direction by asking that code be open source and released along with the paper. So even if journals have no clear policies yet, we, the community, can enforce that code be released every time we review a paper.
In addition to ensuring reproducility of assay data and results, it is always a good idea to try to validate the results of a study using an idependent platform or data set. This is particularly relevant for studies involving large data sets that can generate long lists of novel findings such as a list of differentially expressed genes from a microarray experiment or a list of transcription factor binding sites from a ChIP-Seq (chromatin immunoprecipitation followed by sequencing) experiment. In the context of gene expression or ChIP-Seq, quantitative polymerase chain reaction (qPCR) can be used to validate some of the genes or sites [45, 46]. Note that such experimental assays (including qPCR) are also subject to variation, which can affect the validation . If direct experimental validation is not feasible, computational validation can be used instead. For example, the list of differentially expressed genes (or biomarkers) can be tested using an independent data set that was generated by a different group. In the context of ChIP-Seq de novo motif finding tools have been used to validate binding sites that contain the expected motifs .
The lack of validation partially explains why very few published biomakers have clinical utilities . In addition, when it gets to statistical inferences, robustness in model building and stability in feature selection due to sampling variations may also contribute greatly to the reproducibilty of analysis results. Several schools of intensive research have been dedicated to this area lately. For example, data mining or high-dimensional data analyses methods that incorporate resampling techniques, e.g. bagging  or boosting , often provide more stable and hence more reproducible results . Similarly, predictions based on consensus of multiple analysis results are generally more robust and perform better than any single method .
Several tools have been proposed to automatically incorporate reproducible data analysis pipelines or computer code into documents. An example is the GenePattern Word plugin that can be used to embed analysis pipelines in a document and rerun them on any GenePattern server from the Word application . Another example that is popular among statisticians and bioinformatics is the Sweave literate language  that allows one to create dynamic reports by embedding R code in latex documents. This is our preferred approach because it is open source and does not depend on proprietary software. As an example, every Bioconductor package is required to have fully reproducible documentation (called a vignette) written in the Sweave language. Recent software development tools such as RStudio (rstudio.org) and knitr (yihui.name/knitr) have made working with Sweave even more accessible, which should reduce the learning curve for most users. In fact, this article was written using the Sweave language and processed using RStudio and the source file (along with all versions of it) is available from GitHub (http://github.com/raphg/BiB-review-CR). Ideally, all material including the Sweave source file, computer code and data, which Gentleman and Temple refers to as a ‘compendium’ , would be made available along with the final version of the manuscript and be open access, allowing anyone to reproduce the results or identify potential problems in the analysis. An obvious option would be to package code, data and the Sweave source file into an R package for ease of distribution as is commonly done for Bioconductor data packages. Anyone could directly install this package in R and have access to all necessary materials. Journals that promote this openness should further improve their impact versus non-open journals by giving more credibility to the published results, in the same fashion that open access journals typically have greater impact factors . Unfortunately, currently very few journals are pushing for full reproducibility and even less have clear reproducibility policies. An example of a journal moving in the right direction is Biostatistics. Biostatistics now has a reproducibility guideline and is now working with authors toward making sure that published results are reproducible given that data and code are provided . When data and code are provided and results can be reproduced by the associate editor, the article is marked with an R for reproducible.
We have reviewed some of the key steps involved in the C&R of biomedical data going from protocols to code and data sharing. For ease of reference, Tables 11 and and22 summarize some of the ideas discussed including available resources and a checklist for a comparable and reproducible scientific discovery. Even though experiments, protocols and data analyses have become more complex than ever before, tools and methods for C&R have also significantly improved. Unfortunately, we are still far from the ideal situation where every study can be reproduced and relevant data be compared and pooled across laboratories or institutions. Besides experiment and protocol consistency, there is still a lot of work to be done in terms of data and analysis standardization that would not only improve reproducibility but also facilitate data exchange and meta analyses. Perhaps one way to achieve this is for experimental and computational groups to work together when developing novel assays, standards and analysis tools. This is something that is integral to the CAVD and HIPC projects mentioned previously. For example, both the CAVD and HIPC have bioinformatics and biostatistics and assays subcommittees that work together to optimize and standardize novel assays and analysis tools.
In terms of data, code and software sharing, we cannot yet rely on goodwill and self discipline when it comes to sharing publication material and making studies fully reproducible. As such, we feel that today, the most important step forward toward improving C&R is for funding agencies, publishers and researchers to work together by setting very strict reproducibility guidelines and policies. Such policies could potentially save a great deal of money and resources by making sure that scientific errors can quickly be discovered and corrected instead of giving birth to new scientific projects and clinical trials based on erroneous results. Of course, no one should be afraid of making their publication material available because someone might identify a flaw in the study. As Alexander Pope said, ‘To err is human, to forgive is divine’; we all learn by our mistakes and this is the only way science can move forward.
This work was supported by Bill and Melinda Gates Foundation [OPP1032317] and National Institutes of Health [U01 AI068635-01 and U19 AI089986-01].
Yunda Huang specializes in the design and analysis of pre-clinical and clinical vaccine studies. She is currently Senior Staff Scientist at the Fred Hutchinson Cancer Research Center.
Raphael Gottardo specializes in the development of statistical methods and software tools for the analysis of high-throughput and high-dimensional biological assays. He is currently an associate member at the Fred Hutchinson Cancer Research Center and an affiliate associate professor at the University of Washingon.