Recent advances in high-throughput genomic and proteomic analysis are generating enormous amounts of quantitative biological data. While there are distinct advantages to comprehensively quantifying biological variables such as mRNA expression level or genomic copy number across many genes and genomic loci, the scale of the data presents statistical challenges. The ratio of the number of measurements to the number of samples can become very large, which can result in spurious statistical observations. Many groups have addressed this ‘multiple comparisons problem’. Resampling and permutation based approaches, for example, can help to establish meaningful significance cutoffs (Westfall and Young 1993
; Tusher et al 2001
; Segal et al 2003
; Storey and Tibshirani 2003
). Frequently, however, signals in such data sets are too subtle to yield statistics that pass rigorous significance cutoffs.
One particularly straightforward method to reduce dimensionality is to select only those variables that meet criteria that are orthogonal to the property being investigated. For the purposes of this discussion, we define a ‘data type’ as a category of qualitative or quantitative information gathered from a biological sample. In this study, the data types include mRNA expression level measurements, genomic copy number measurements, and patient survival data associated with ovarian tumors. A ‘variable’ is defined as a single measurement of a multivariate data type. Variables would include mRNA measurements for individual genes or genomic copy number measurements for individual genomic regions.
Biological annotations can provide such a means of variable selection. By restricting a data set to only those variables whose annotations meet certain criteria, the dimensionality of the data set may be reduced such that multiple comparisons do not predominate. This concept, properly generalized, also supports integrated analysis across data types and across data sets. Given a series of tumor samples of known outcome, with experimental data comprising both DNA copy number and mRNA expression measurements, natural questions tend to span data types or require data annotation. Does genomic copy number directly account for some of the variation in gene expression across samples? Are the genes that map to loci that are frequently found to be of aberrant copy number more likely to show an association with outcome than other genes? Are genes that have functional annotations for processes involved in cancer (for example, adhesion, apoptosis, invasion, ...) more likely to be associated with tumor aggressiveness?
Questions such as these, which combine aspects of biology with aspects of statistics, tend to be very difficult for biological researchers themselves to answer during exploratory data analysis. Conversely, expert data analysts that have the skills required to answer such questions do not generally have the background to ask biologically motivated questions that might be stimulated by exploratory analysis. Since interpretation and exploration of data benefit from the background knowledge of the biologist, we have implemented a web-based system called Magellan that supports integrated analysis of complex, heterogeneous data sets. Magellan’s target audience is not experts in data analysis, but experimentalists that wish to explore their own data using established analytical methods while incorporating biologically meaningful annotations.
Here we present the Magellan system and its use in the analysis of combined comparative genomic hybridization (CGH), mRNA expression, and clinical data from 20 ovarian tumor samples (including 10 short survivors and 10 long survivors). Magellan supports univariate data analysis (with corrections for multiple comparisons), which identified one gene (spermidine/spermine N1-acetyltransferase) whose mRNA expression pattern correlated significantly with patient survival. Analyzing CGH data, we observed no single genomic locus whose gain or loss correlated significantly with patient survival, but a simple classification approach correctly predicted patient survival in 85% of cases under cross-validation. By making use of biological annotations including genomic mapping locations and functional gene annotations, more complex observations could be quantified: 1) DNA copy number and gene expression level are correlated genome-wide, on average; 2) genomic loci that were enriched for genes whose variation in expression correlated with survival exhibited copy number variation that correlated better with survival than all loci; and 3) genes that mapped to genomic loci that were frequently altered in copy number showed variation that correlated better with survival than all genes.
Magellan is part of the National Cancer Bioinformatics Grid (CaBIG) Project (http://cabig.nci.nih.gov/
). Source code and documentation are available by email request. This system supports the use of open source analytical tools such that new methods of analysis can be rapidly prototyped and deployed. As part of the CaBIG project, instances of Magellan will be deployed at multiple institutions in support of cancer research across a wide community of experimentalists.