Although it has been proposed that tissue-specific gene expression profiling may facilitate disease gene identification59,60
and that gene expression data sets for many tissue and cell types exist, the application of these resources to gene discovery, particularly in the context of disease, has been limited.61
This is primarily because such data sets are large and the route to efficient selection and prioritization of candidate genes is not straightforward, especially in the context of normal development and in the absence of clear control versus mutant gene expression change comparisons. Several gene expression atlases based on in situ hybridization provide insight into developmental gene expression,62
but such information is typically nonquantitative and does not permit facile comparison of tissue-specific gene expression levels. In this work, we developed a strategy to subject tissue-specific microarray data sets to in silico subtraction that involves comparison of a tissue-specific data set with a WB reference data set, which allows the systematic ranking of genes based on their tissue enrichment. Even with high throughput sequencing, mutations that lie outside the coding regions may be difficult to identify. We demonstrate that this filter provides a highly effective way to identify candidate genes associated with the development of specific tissues for which gene expression profiles can be readily obtained.
The development of iSyTE was based on two basic hypotheses. The first is that genes that are highly expressed at critical stages of murine embryonic development in a specific organ are likely associated with mutations in human genes that are linked to an organ-specific birth defect. The second is that in silico subtraction of gene expression profiles for whole embryonic body from those for equivalently staged specific, microdissected embryonic tissue can effectively remove nonspecific but highly expressed genes, thereby revealing tissue-specific genes. Using lens and tooth as examples, we show that this relatively straightforward experimental and computational approach can effectively facilitate the identification of human disease–associated genes.
As with any gene prediction tool, there is a false-negative rate associated with a given prediction, and it is important to consider the potential source of false negatives when interpreting results from iSyTE. Our retrospective analysis of 24 known cataract genes indicates that approximately 10% of the genes do not have high lens expression or enriched expression as measured in the current microarray data, thereby suggesting a false-negative rate of approximately 10%. This could potentially result from the following factors: the sensitivity of the microarray probes for these genes may be poor; the expression of these genes may be restricted to a different developmental stage than those analyzed; and the effect of lens-specific expression is masked by neighboring genes within the candidate interval, which have higher levels of lens-specific expression but which are noncausative.
Indeed, such examples are evident in our present data analysis. For example, in 3 of 24 cases (FYCO1, GCNT2, CHMP4B), iSyTE did not rank the correct gene within the top two candidates in the interval (). On further analysis, in case of FYCO1 (ranked 21/191), the mapped interval was large (12.21 Mb) and contained 191 candidate genes, several of which exhibited significantly higher lens-enriched expression than FYCO1. In GCNT2 (ranked 7/21 within a 5.26-Mb interval), we found very low expression of this gene in the microarrays, indicative of either suboptimal probe binding or genuinely low expression at the lens stages analyzed. In CHMP4B (ranked 34/43 in a 3.03-Mb mapped interval), this gene is significantly expressed in the lens (signal detection P < 0.002), but it is also significantly expressed in the WB control. As a result, it does not have a high lens-enrichment rank and is therefore not correctly identified by iSyTE as a likely candidate gene.
In some cases, iSyTE does not predict any promising candidate genes based on lens enrichment (e.g., in the mapped human cataract intervals on 2q33 and 17p24) (). In yet another case (20p11.23-p12.1), iSyTE predicted BFSP1 from 29 candidates in the interval (). However, in this interval, BFSP1 has been sequenced and found to harbor no exonic or exon junction mutation, suggesting that the mutation resides in a regulatory region or in another gene. Therefore, in all cases, further experimental validation through mutational sequence analysis will be necessary, in addition to the in silico predictions made by iSyTE.
Other genomewide in silico analyses have recently been applied to the interpretation of candidate SNPs in genomewide association studies (GWAS).63
For example, Ernst et al.64
showed that cell-type specific histone modification patterns can identify regulatory regions and that knowledge of the location of these regulatory regions and their associated genes can aid in the interpretation of GWAS by providing potential regulatory mechanisms for each candidate SNP. Similarly, Ozkul et al.65
have devised a strategy based on ChIP-seq data for the transcription factor CRX to rank candidate genes within mapped intervals for retinitis pigmentosa (RP). Combined with exome sequencing, this approach successfully identified a novel mutation in the gene MAK
, which is associated with RP. In the work reported here, we demonstrate a cost-effective strategy to effectively prioritize mutations for human disease gene identification. Because embryonic dissections can be readily performed in many research laboratories and because microarray is increasingly affordable, the iSyTE
approach should be applicable to other organ- and tissue-specific diseases, as demonstrated by our tooth germ analysis.
In conclusion, we describe a novel strategy for identifying disease-associated genes that is supported by a publicly available Web resource called iSyTE. We recently used a preliminary version of iSyTE to help identify two human genes associated with cataract, TDRD7 and PVRL3. Because there are likely many other candidate cataract-associated genes that have not yet been identified, this Web-based resource should provide a useful tool for the ocular genetics community. Besides serving to identify lens-specific disease genes, future versions of iSyTE that include expression data sets for other ocular components should further help identify additional genes that influence the development and biology of the eye.