Aggregate analysis of gene expression microarrays (Lipshutz et al, 1995
; Schena et al, 1995
) across multiple studies is lending an unprecedented molecular view of the broad spectrum of human disease (Alizadeh et al, 2000
; Golub et al, 1999
). Ramaswamy et al (2003
) were among the first to show how a taxonomy of cancers could be created after building a reference collection of gene expression profiles for multiple types of cancers. This approach was extended to find common changes in gene expression across publicly available cancer microarray experiments (Rhodes et al, 2004
). Segal et al (2004
) integrated 1975 microarrays, representing 22 tumor types, to uncover a ‘module map' of gene modules with conditional expression patterns across tumor types. Despite these successes, the considerable variation inherent to microarray data greatly confounds efforts to integrate data across multiple experiments.
There have been a number of efforts to characterize and mitigate potentially confounding, non-biological sources of variance in microarray data. In 2006, the Microarray Quality Control Consortium (MAQC) showed that measurements are technically reproducible across test sites and manufacturer (Shi et al, 2006
). It was shown that lab-to-lab variation imparts a significant effect on microarray measurements (Irizarry et al, 2005
), however, a number of robust methods to handle such variation have been developed (Breitling et al, 2004
; Choi et al, 2007
; Huttenhower et al, 2006
; Pihur et al, 2008
; Zilliox and Irizarry, 2007
). Although these efforts lend credence to the technical equivalence of microarray data across experiments, the biological equivalence of microarray data across experiments is not well characterized.
A recent study suggests that gene expression measurements can be combined to gain new biological insights that are relevant beyond their original experimental context. Bild et al (2006
) built a collection of genome-wide changes in breast cancer cell lines in response to the overexpression of several oncogenes, then used these to probe public microarray measurements of other types of cancers. Similarly, Lamb et al (2006
) built a larger collection of responses in human breast cancer cell lines toward 164 different small molecules, then used these to probe previously unexplainable gene expression changes in completely different tissues and diseases, finding agonists with responses equivalent to a diet-induced obesity model in rat fat cells. These studies suggest that the signature of a disease is robust irrespective of the tissue in which it was studied, however, the generalization of this phenomenon across all of human disease has not been established. To fully evaluate such a hypothesis requires a sufficiently large and diverse collection of microarray data for human diseases.
Public microarray data repositories have emerged as enabling resources for the integrative genomic study of human disease (Rhodes and Chinnaiyan, 2005
). Coincident with their successful use, and because many journals require the public availability of such data (Anonymous, 2002
), the amount of microarray data in international repositories is now growing exponentially (Parkinson et al, 2009
). The largest among these is the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) (Wheeler et al, 2006
). As of this writing, GEO holds information on >300 000 samples (i.e. microarrays) from >12 000 experiments, and doubles in size each year. Enabled by the vast repertoire of GEO experiments studying numerous human diseases (e.g. diabetes) across a broad diversity of tissues types (e.g. muscle and fat), we can pose an important question in integrative biology: is there a general disease concordance across public microarray experiments irrespective of platform and tissue? In this study, we carried out a systematic evaluation of disease-associated experiments in GEO to evaluate the robustness of the disease signal across tissues and experiments.
To ensure our findings were robust and unbiased towards any specific choice of analytic methodology, we designed a computational ‘pipeline' using 84 combinations of normalization, probe-level integration, and significance testing methods (Box 1
). We find that there is a general concordance between disease states across tissues, irrespective of other confounding sources of biological or technical variation inherent in the data. Furthermore, we find that this disease concordance is more prominent than other potentially concordant biological factors, such as tissue type. Our results raise several important implications for the downstream translational research value of public microarray data in building systematic models of disease pathogenesis, prognosis, and treatment.
Schematic diagram for the pipeline used to evaluate disease concordance across public microarray experiments.
The full complement of 429 disease-associated microarray experiments was repeatedly evaluated by the pipeline under all 84 possible combinations of pipeline parameters to comprehensively evaluate the robustness of disease signatures.