We describe here the creation of a new web-accessible gene-gene correlation resource, and demonstrate the power and utility of a large collection of gene expression microarray data for functional gene discovery and for prioritizing genes for mutation analysis within linkage regions.
We first expanded a list of known cartilage-selective genes in mouse and humans. Within this process, 7 genes (out of 10) all with poor or little annotation were validated and demonstrate cartilage selective expression. Two novel genes tested, SDK2 and FLJ41170, are very selectively expressed in fetal cartilage. The power of a large dataset is realized with this example as rare tissues (e.g.cartilage or fetal cartilage) and many novel genes (e.g. FLJ41170) are not present in Genesapiens. In addition, it is remarkable that despite the relatively small number of human fetal cartilage gene arrays in the public domain (<14 vs. 10,000 arrays), UGET is remarkably sensitive even when it comes to genes expressed in few tissues. Similarly, UGET can be used iteratively by scientists to identify genes with similar expression profiles for a variety of patterns in order to identify genes that may be involved in specific biological processes.
By simple use of UGET, the correct retrospective identification of the known causal genes within linkage regions for several unrelated disease phenotypes was demonstrated in the majority of cases. This simple application of the tool is in itself a highly powerful strategy because there are many more linkage regions reported than causal genes, yet there are many disease genes identified for a similar phenotype. Thus, we can effectively leverage current knowledge to prioritize genes within the many known linkage intervals. From our small number of examples, our evidence would indicate that, if there are at least a few known disease genes to use to create a profile of interest, the highest or second highest correlated gene will be the mutant gene at up to 80% of the loci. This approach is broadly applicable to more heterogeneous traits such as neuropsychiatric disorders, with 50% of previously identified genes within candidate intervals ranking either highest or second highest correlated gene using an autism-related expression module. A subset of genetically heterogeneous disorders however, has no strong correlation of expression between the multiple genes causing specific traits (for example, Hirschsprung's Disease). We note that the general process of gene prioritization applies to the entire genome and may point to strategies in the absence of linkage knowledge as well for rare Mendelian disorders. Likewise, the genetic causes of highly complex disorders such as autism or schizophrenia are certain to be numerous, but as true disease genes are identified in these disorders and others, gene-gene correlation analyses should be applied to prioritize additional genes in the absence of linkage or genome-wide association signals. We expect the tool to be generally useful in a wide variety of human disease areas and to expedite the gene discovery process. While there are other gene expression prioritization tools and other prioritization approaches (e.g. via interactome data and literature based) that are also successful, the data suggest that UGET is a robust tool especially when the genes and biological processes are better defined in rare or more complex datasets. In this regard, UGET takes advantage of the vast Celsius database which provides additional insight not possible using smaller, more discrete, but defined annotated datasets.
The tool is web accessible and easily searchable by scientists seeking to identify genome-wide gene-gene correlations. Data are returned as html lists for rapid perusal or as tab delimited text files available for download. The power and versatility of this resource initially surprised us and will be able to grow in power as microarray data accumulate. One of the powers of this approach is that the entire pipeline of methods used to assemble the correlation matrix is completely metadata independent – only the genomic alignment of probe sequences and the quantitative measurements made by the microarray were used. This results in a dataset that is very heterogeneous reflecting the diverse set of human experiments ongoing in the community. It is composed from microarray data generated from thousands of individual experiments by hundreds of individual scientists, with each experiment using different biological materials and different hybridization conditions and protocols. We conclude that the volume of data assembled here is sufficient and does not appear to have systematic biases based on site of origin of the data or differences in the method of data generation to mask true gene-gene correlations due to differences in microarray protocols.
A variety of analyses presented here have established that rank ordering of genes within a linkage interval using UGET is a successful approach in many cases. Generally, UGET is highly successful in ranking candidates when the disease gene has an expression pattern specific to the biological process being studied. Alternatively, genes involved in multiple biological processes (or genes not involved in the disease BP in normal individuals) may not rank as highly. As an example, AHI1 was ranked fourth best candidate gene out of 89 genes when UGET was used to retrospectively identify known Joubert Syndrome (JBST) genes. AHI1 is expressed in both central nervous system tissue and primitive hematopoietic cells, which diluted the strength of co-expression correlation to other JBST genes. This demonstrates key limitations in the application of the guilt by association approach to disease gene identification. False negative results will occur in such cases. This lessens the utility of UGET (and other methods such as Endeavour and SUSPECTS) to providing a specific type of biological insight. This limitation also leads to false positives arising if genes within a candidate interval are specifically expressed in the BP of interest but do not contribute to disease. Despite these limitations, in the majority of retrospective cases presented here (including AHI1), UGET analysis rank ordered candidate genes such that sequentially sequencing the most highly correlated genes would identify the known disease gene more far efficiently than sequencing all genes within the candidate interval. Thus, the limited scope of biological insight provided by UGET is nonetheless highly informative for human disease gene discovery.
While we demonstrate the utility of this tool particularly for human gene identification, the correlation matrices have been constructed for the entire Celsius dataset and are available for search within 14 different species across 41 different array designs. We note that the scale available through this resource is unprecedented and is the result of ignoring differences in annotation approaches by scientists. In general the genomics community has placed high value on the annotation information such that publication policies typically require metadata to be deposited concomitantly with assay measurements. However, without truly representing the vast diversity of experiments performed is daunting and not implemented yet. While annotation is useful and in some cases necessary for some supervised analysis, the practice of insisting on detailed metadata in the process of making raw microarray data available may actually limit the amount of raw data deposited. For instance, scientists have no strong motivation to provide annotation information on experiments that do not become part of the published experiment and thus these data will be excluded from repositories. In contrast to the annotation-centric efforts of microarray repositories, the Celsius database can import CEL data without annotation data and the work shown here demonstrates the enormous potential power of growing these data further. We recommend the deposition of all CEL files into public repositories or directly into Celsius to expand our ability to detect gene-gene correlations.