|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: UA RMP PP FDC. Analyzed the data: UA RMP. Contributed reagents/materials/analysis tools: EG CD LS MO. Wrote the paper: PP FDC.
Even in the post-genomic era, the identification of candidate genes within loci associated with human genetic diseases is a very demanding task, because the critical region may typically contain hundreds of positional candidates. Since genes implicated in similar phenotypes tend to share very similar expression profiles, high throughput gene expression data may represent a very important resource to identify the best candidates for sequencing. However, so far, gene coexpression has not been used very successfully to prioritize positional candidates.
We show that it is possible to reliably identify disease-relevant relationships among genes from massive microarray datasets by concentrating only on genes sharing similar expression profiles in both human and mouse. Moreover, we show systematically that the integration of human-mouse conserved coexpression with a phenotype similarity map allows the efficient identification of disease genes in large genomic regions. Finally, using this approach on 850 OMIM loci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases.
Our results demonstrate that conserved coexpression, even at the human-mouse phylogenetic distance, represents a very strong criterion to predict disease-relevant relationships among human genes.
One of the most limiting aspects of biological research in the post-genomic era is the capability to integrate massive datasets on gene structure and function for producing useful biological knowledge. In this report we have applied an integrative approach to address the problem of identifying likely candidate genes within loci associated with human genetic diseases. Despite the recent progress in sequencing technologies, approaching this problem from an experimental perspective still represents a very demanding task, because the critical region may typically contain hundreds of positional candidates. We found that by concentrating only on genes sharing similar expression profiles in both human and mouse, massive microarray datasets can be used to reliably identify disease-relevant relationships among genes. Moreover, we found that integrating the coexpression criterion with systematic phenome analysis allows efficient identification of disease genes in large genomic regions. Using this approach on 850 OMIM loci characterized by unknown molecular basis, we propose high-probability candidates for 81 genetic diseases.
In the last two decades, positional cloning has been remarkably successful in the identification of genes involved in human disorders. More recently, our ability to map genetic disease loci has strikingly increased due to the availability of the entire genome sequence. Nevertheless, once a disease locus has been mapped, the identification of the mutation responsible for the phenotype still represents a very demanding task, because the mapped region may typically contain hundreds of candidates . Accordingly, many phenotypes mapped on the genome by linkage analysis are not yet associated to any validated disease gene (850 OMIM entries for phenotypes with unknown molecular basis had at least one associated disease locus on July 2nd, 2007). Therefore, the definition of strategies that can pinpoint the most likely targets to be sequenced in patients is of critical importance . Many different strategies have been proposed to prioritize genes located in critical map intervals. Some of the methods so far developed rely on the observation that disease genes tend to share common global properties, which can be deduced directly by absolute and comparative sequence analysis . However, most of the available prioritization strategies are based on the widely accepted idea that genes and proteins of living organisms deploy their functions as part of sophisticated functional modules, based on a complex series of physical, metabolic and regulatory interactions ,. Although this principle has been extensively used even in the pre-genome era to identify the critical players of many different biological phenomena, the present availability of genome-scale information on gene function, protein-protein interactions and gene expression in different experimental models allows unprecedented opportunities for approaching the prioritization problem with greater efficiency.
In theory, the use of functional gene annotations would represent the most straightforward approach for candidate prioritization. However, although this strategy may be very useful in selected cases ,, at the present stage it has clear limitations, either because it overlooks non-annotated genes , or because it is not evident how the annotated functions of the candidates relate to the disease phenotype. Therefore, computational methods less biased toward already consolidated knowledge, may have strong advantages .
In particular, protein-protein interaction maps and gene coexpression data from microarray experiments represent extremely rich sources of potentially relevant information.
Recently, the direct integration of a very heterogeneous human interactome with a text mining-based map of phenotype similarity has allowed the prediction of high confidence candidates within large disease-associated loci .
Although this approach is highly efficient, it is clearly not exhaustive because very close functional relationships between genes and proteins are possible in the absence of direct molecular binding. In addition, the protein-protein interaction space is currently under-sampled and many genuine biological interactions have not yet been identified in experiments. Conversely, high-throughput experiments are known to result in a large fraction of false positives. The consistently low overlap of protein-protein interactions between large-scale experiments, even when the same proteins are considered, is testament to these problems . Finally, many of the known protein-protein interactions have been ascertained through low-throughput experiments and are thus strongly biased towards better-studied proteins .
Since genes involved in the same functions tend to show very similar expression profiles, coexpression analysis could be a very powerful approach for inferring functional relationships, which may correlate with similar disease phenotypes. Accordingly, global analyses have shown that genes highly coexpressed across microarray experiments display very similar functional annotation . However, with notable exceptions , so far the coexpression criterion has not been employed very successfully for the prediction of genetic disease candidates and has been used to this purpose only in combination with other independent evidence ,.
The noisy nature of high-throughput gene expression datasets may represent one of the possible explanations for this shortcoming. Moreover, even when the coexpression of two genes is reproducibly observed under a high number of experimental conditions, this does not necessarily imply that the genes are functionally related. For instance, extensive meta analysis of microarray data across different species has revealed that neighboring genes are more likely to be coexpressed than genes encoded in distant genomic regions, even if they are not functionally related in any obvious manner ,.
Phylogenetic conservation has been previously proposed as a very strong criterion to identify functionally relevant coexpression links among genes ,. Indeed, significant coexpression of two or more orthologous genes is very likely due to selective advantage, strongly suggesting a functional relation. Therefore, conserved coexpression could be a much stronger criterion than single species coexpression to relate genes involved in similar disease phenotypes.
In this report we show that conserved coexpression and phenome analysis can be effectively integrated to produce accurate predictions of human disease genes. Using this approach we were able to select a small number of strong candidates for 81 human diseases, corresponding to a wide spectrum of different phenotypes.
Figure 1 schematically illustrates our approach.
We have generated two human-mouse conserved coexpression networks (CCN), based on cDNA and oligonucleotide microarray platforms, respectively.
In the first case, the starting data were log-transformed values of human and mouse ratiometric experiments, downloaded from the Stanford Microarray Database (SMD)  (4129 experiments for 102296 EST probes for human and 467 experiments for 80595 EST probes for mouse). The resulting network will be referred to as ‘Stanford’ in the following (Text S1).
In the second case (‘Affy’, Text S2), the network was based on previously described series of normal tissue Affymetrix microarray experiments from human  (353 experiments corresponding to 65 different tissues for 46241 probe-sets associated to a known gene) and mouse  (122 experiments corresponding to 61 tissues for 19692 probe-sets). In the human case, the Affymetrix experiments corresponding to the same tissue were averaged to compensate for the different number of replicates available for the various tissues.
In both cases, we used the same procedure to generate a final CCN. In particular, we first generated single species gene coexpression networks (SCN) and then integrated them on the basis of human-mouse orthology, as detailed below.
SCNs were generated by first calculating the Pearson correlation coefficients of every row in the expression matrix (cDNA probe or Affymetrix probe-set) with all other rows. A directed edge was established from row r1 to row r2 if r2 fell within the top 1% rows in terms of correlation with r1. The threshold was first chosen on the basis of a previous study, showing that such 1% interval is most significantly enriched in terms of functionally relevant coexpression . Moreover, we confirmed that using a more stringent 0.5% threshold results in strongly reduced sensitivity (data not shown).
These directed networks where then converted into undirected SCNs by mapping the rows to the corresponding Entrez Gene identifiers : an edge is established between two Entrez gene IDs G1 and G2 if there is at least one edge from a row assigned to G1 to a row assigned to G2 and vice versa. The correspondence between probe-sets and Entrez gene IDs for the SMD data was established using Unigene (build 190 and 152 for human and mouse, respectively). For Affymetrix data we used the annotation files provided by Affymetrix for each platform.
Finally, CCNs were built from SCNs by mapping every Entrez Gene identifier to the corresponding Homologene cluster (build 55) and retaining only the cases in which a one-to-one correspondence could be established between the human and mouse Entrez gene IDs appearing in the SCNs.
Genetic disease phenotypes described in OMIM where correlated on the basis of MimMiner . MimMiner assigns a similarity score to all pairs of OMIM phenotype records, based on the text mining analysis of their phenotype descriptions . Two phenotypes were defined to be similar if their score was at least 0.4, because biologically meaningful relationships were mostly detected in phenotype pairs with a similarity score equal or greater than this value . About 1% of all possible pairs of phenotypes included in MimMiner pass this threshold.
The analysis of the CCNs was based on the construction of coexpression clusters, defined as a given gene (the center of the cluster) plus its nearest neighbors in the conserved coexpression network, thus obtaining one cluster for each gene. The prevalence of genes joining functionally related genes in the CCNs was tested by analyzing the prevalence of Gene Ontology terms within coexpression clusters, compared to the same prevalence in randomized coexpression clusters. We counted the number of coexpression clusters for which at least one Gene Ontology term was significantly overrepresented (P-value less than 10−4 with exact Fisher test), and compared this number with the same number averaged over 100 randomized CCNs. This was done separately for the Affy and Stanford networks, and the results are shown in Figure 2A.
The overlap between CCNs and protein interaction networks was evaluated by downloading the list of known interactions between human proteins from HPRD ,. To take into account the different experimental methods on which the HPRD interactions are based and their varying degree of reliability, we analyzed separately in-vivo, in vitro and yeast double hybrid interactions. In each case, separately for the Affy and Stanford networks, we compared the overlap between the CCNs and the protein interaction network to the same overlap averaged over 100 randomized CCNs. The results are shown in Figure 2B.
Finally to verify whether the CCNs were enriched in edges joining genes causing similar phenotypes, we constructed a network of human genes in which an edge was placed between every pair of genes known to be involved in the same disease or in diseases with MimMiner similarity score at least equal to 0.4. The mapping between OMIM phenotypes and genes known to cause them was obtained from Ensembl, version 45 . We then evaluated the overlap between this network and the CCNs, again compared to the same overlap averaged over 100 randomized CCNs (Figure 2C).
The lists of genes contained in OMIM loci of unknown molecular basis were obtained from Ensembl, version 45 . To identify likely candidates for a given disease-associated genomic locus, we first extracted from the networks disease-relevant conserved coexpression clusters. A cluster was considered relevant to a given disease d if it contained at least two genes experimentally known to cause phenotypes similar to d. These clusters will be called ‘disease clusters’ in the following. The genes of the disease clusters, which are also contained in the map interval of the locus associated to d, were retained for scoring as described below.
The size of the coexpression clusters and of the loci are very heterogeneous: coexpression clusters contain between 2 and 186 genes while loci associated to OMIM diseases with unknown molecular basis vary between 3 and 2153 genes. Therefore the genes selected in the previous step were assigned a probabilistic score based on the null hypothesis in which the coexpression clusters are random sets of genes. The score is essentially the probability that the two events leading to the identification of a candidate gene for a disease occur by chance. The two events are (1) the presence in a coexpression cluster of at least 2 genes known to be involved in phenotypes similar to the disease in question and (2) the presence in the same coexpression cluster of a gene located within the locus relevant to the disease. First, we computed a P-value p1 for associating a cluster to the given disease by chance. This P-value is given by the cumulative hypergeometric distribution considering: the number RDC of genes linked to similar phenotypes that have been found in the cluster (at least 2 to associate the cluster with the disease); the number Rall of genes in the network that are linked to similar phenotypes; the number GDC of genes in the cluster; and the total number of genes Gall in the network:
Second, we computed the P-value p2 of the overlap between the disease cluster and the genetic locus associated to the disease. This P-value is given by the cumulative hypergeometric distribution considering: the number LDC of genes in the locus that are also present in the given disease cluster; the number Lall of genes in the locus that are present in the network; and GDC and Gall are defined as above:
In the null hypothesis, associating a cluster with a disease and finding a gene in the cluster that belongs to the appropriate orphan locus are independent events. Thus, the total score for a predicted candidate is given by the product p1 • p2. When a candidate is found in more than one disease cluster, we consider only the lowest (best) score.
The cutoff on such scores was determined by estimating the false discovery rate (FDR) for each possible cutoff using 100 randomized CCNs per dataset: The false discovery rate is defined as the ratio between the average number of predictions made using randomized CCNs and the number of predictions made using the real CCN, with the same cutoff. The cutoffs thus obtained for a 10% FDR were 4.49·10−6 for the Affy and 2.67·10−6 for the Stanford network, respectively.
To estimate the precision of our procedure we used a leave-one-out strategy: For every gene experimentally associated to an OMIM phenotype we constructed artificial loci centered on the disease gene, and removed all associations between this particular phenotype and all the genes known to cause it, so as to simulate a phenotype with unknown molecular basis. The association between phenotypes similar to the one under examination and the corresponding genes was instead retained.
In order to take into account the variability of locus sizes we constructed artificial loci of various sizes, by taking the disease gene plus the N closest genes on each side of the chromosome (according to their start position on the chromosome). The artificial loci thus contained up to 2N+1 genes, but could contain fewer genes when the disease gene was close to one of the chromosome ends. In the following discussion, such artificial loci will be denoted as N20, N50, … N500 for locus sizes up to 41, 101 … 1001 genes, respectively. This range of locus sizes was chosen based on the observed size distribution of orphan loci: OMIM loci for diseases with unknown molecular basis contain an average number of about 273 genes (median: 180 genes).
We considered the disease gene as correctly identified if it was selected as a candidate by our method with the same score threshold that we used for the orphan loci. The precision is defined as the ratio between the number of cases with correctly identified disease genes and the number of cases with at least one selected candidate, that is, the fraction of cases with selected candidates in which the disease gene was among the candidate list.
Conserved coexpression has been previously reported to be an efficient criterion to identify functionally related genes ,. Therefore, to discover new relationships between human genes with a high potential relevance for disease phenotypes, we produced the two human-mouse gene coexpression networks described above, covering different platforms and experimental conditions. In particular, the Stanford network (supporting file S1) was generated from data based on cDNA platforms, corresponding mostly to experiments performed on tumor cell lines. In contrast, the Affy network (Text S2) was derived from normal tissue data, generated on Affymetrix platforms in two independent studies ,. The Stanford network has 8512 nodes (genes) and 56397 edges, with an average connectivity of 13.2 edges per node. The Affy network is composed of 12766 nodes and 155403 edges, with an average connectivity of 24.3 edges per node. Both networks contain a large connected component of 2305 and 4122 genes, respectively, with some other small connected components containing only a few nodes.
As expected from previous studies on gene coexpression networks , the two networks are topologically similar to other biological networks, characterized by the existence of a few highly connected nodes (hubs), but they show a connectivity distribution more similar to an exponential law than to a scale-free one (data not shown). More importantly, if compared with 100 random permutations, both networks show a strong prevalence of edges between genes that are annotated to the same Gene Ontology (GO) keyword (Figure 2A). This confirms that human-mouse conserved coexpression is a valuable criterion to identify functionally related genes. Accordingly, both networks show a highly significant overlap with protein-protein interactions reported in the Human Protein Reference Database (HPRD) , (Figure 2B).
Since many genes, such as those involved in basic cellular functions, should be coexpressed regardless of the particular experimental situation, we would expect the two networks to have many common links. Indeed, they share 2305 edges, between the 7332 common nodes, which represents a striking overlap (the randomized Affy networks had on average 88.4 edges in common with the Stanford network, with a standard deviation of 8.7). On the other hand, the large number of specific links that characterize the two networks indicates that they provide highly complementary information.
Finally, to evaluate the capability of conserved coexpression to link genes involved in similar disease phenotypes, we measured in both networks the prevalence of links between genes associated to phenotypes with similar descriptions . Interestingly, both networks showed a strong enrichment, if compared with the average number obtained from the randomized networks (Figure 2C). We concluded that the two networks represent complementary resources that could efficiently predict disease-relevant relationships among human genes.
The high prevalence of links between genes involved in similar disease phenotypes, observed in both networks, suggests that they could provide valuable information to identify likely candidates in mapped disease loci. Therefore, we devised an algorithm that integrates our CCNs with phenotype and mapping information to predict candidate disease genes in large genomic regions (Figure 1). As detailed in Materials and Methods, the procedure is based on the extraction from the network of the disease clusters, which we consider to be associated to a given disease since they contain at least two genes involved in similar phenotypes. The genes that are present in both the OMIM phenotype loci and the corresponding disease clusters are considered as candidates and assigned a score based on the size of the locus and of the disease cluster. Randomized runs allowed us to select a score threshold corresponding to a 10% FDR.
To evaluate how our procedure could perform on the loci characterized by unknown molecular basis, we applied a leave-one-out strategy to all the Ensembl genes associated to at least one OMIM disease ID, by constructing artificial loci of variable size around each gene. We then measured the fraction of artificial loci for which we obtained at least one candidate (Figure 3A), the average number of candidates found for these loci (Figure 3B) and the precision (Figure 3C), defined as explained in Materials and Methods. Of the 1762 disease genes contained in OMIM, we could analyze 1426, whose associated phenotypes are present in the MimMiner similarity matrix.
The precision obtained obviously decreases when the size of the artificial loci is increased. However, it is interesting to notice that, while for the smallest artificial loci the precision was excellent (about 68% for both networks), even with the largest artificial loci it was still remarkable (37.5% for the Affy and 29.1% for the Stanford networks, respectively). For N90 (artificial loci with a maximum of 181 genes, very close to the median size of 180 genes for OMIM phenotype loci with unknown molecular basis), for example, the leave-one-out validation yielded at least one candidate for 47.8% (Affy) and 12.7% (Stanford) of the disease loci (Figure 3A). In these cases, an average of 3.67 (Affy) and 2.17 (Stanford) candidates were returned (Figure 3B), that contained the true disease gene in 49.3% (Affy) and 43.6% (Stanford) of the cases (Figure 3C). That is, for both networks, when candidates were returned, the very short candidate lists contained the disease-causing gene with a probability of over 40%.
We next determined how our method performs compared to other existing approaches. This comparison was based on the enrichment in correctly identified disease genes (by leave-one-out) with respect to randomized networks. In our case the fold enrichment is defined as the ratio between number of disease genes correctly recalled in the leave-one out procedure and the same number averaged on 100 randomized CCNs. For this evaluation we used the N90 artificial loci (see Materials and Methods), which are closest in size to the actual orphan loci (median size 180). Fold enrichment values were computed for several published methods by Lage et al  (see their Supplementary material). It must be noticed that the exact definition of the fold enrichment depends on the method under consideration, thus comparison must be taken with caution. However, Table 1 shows that our methods compares favorably with many previously published ones: when considering that, as discussed above, the use of gene expression data allows a less biased analysis compared to most other methods, we conclude that our approach provides a significant and original contribution to the identification of disease genes.
The above results indicate that conserved coexpression can be efficiently combined with phenotype correlation data to provide high confidence candidates within genetic disease loci. Therefore, we applied our procedure to 850 OMIM phenotype entries with at least one mapped disease locus but unknown molecular basis. In Table 2 we provide the list of all the 321 candidates (gene-locus pairs) obtained with 10% FDR. We obtained predictions for 81 loci, 67 of which where only from the Affy network, 5 only from the Stanford and 9 from both. Interestingly, in 4 of the latter cases, the list of candidates from the two networks contained at least one common gene (Table 2).
Notably, for three OMIM phenotypes (163000, familial multiple nevi flammei; 268700, saccharopinuria; 300195; AMMECR1) our predictions include the actual disease genes that, although not yet correctly annotated in OMIM, have been found to be mutated in patients (see Table 2).
For 22 loci, at least one of the candidates obtained from either network was already known to be involved in phenotypes similar to those described for the locus. These genes represent the most obvious candidates and our results should be considered as further, independent evidence for their possible involvement in the disease. However, it must be noted that some of them were previously excluded, either by the direct identification of crossovers or by the negative results of mutation screenings. Nevertheless, since mutations have most likely been searched only within the annotated exons, we think that the decision to definitively rule out the involvement of such candidates should be taken cautiously. Moreover, even silent exonic mutations, although often considered innocuous polymorphisms, can have severe effects on proteins by disrupting splicing patterns ,.
In most cases only few candidates are given for a locus, thus providing extremely focused working hypotheses for the identification of the actual disease genes, which in many cases are made even stronger by the available sequence or functional information.
For instance, one of the two candidates provided for the OMIM phenotype entry 607221 (partial epilepsy with pericentral spikes, located on 4p15) corresponds to KCNIP4 (Figure 4). This protein has been show to specifically modulate the activity of Kv4 A-type potassium channels , which are well known regulators of membrane excitability  and have been recently involved in epilepsy . Another interesting example is given by the 605285 phenotype entry (hereditary motor and sensory neuropathy, Russe type, mapped to 10q23.2, a locus comprising only 26 candidate genes). The only prediction for this locus is gamma-synuclein (SNCG), which is a very strong candidate both for the low P-value and for the known role of synucleins in neurodegenerative disorders .
Even when the number of candidates for a particular locus is substantially higher, our results may provide a strong restriction of the experimental search field, which can be further narrowed by additional evidences. For instance, the phenotype with OMIM ID 130080 (Ehlers-Danlos syndrome, type VIII), is mapped to 12p13, containing 277 genes. In this case, the Affy and Stanford networks provide 8 and 4 candidates, respectively. Interestingly, the candidate with the lowest associated P-value is the Alpha-2-macroglobulin precursor (A2M), whose absence was previously reported in a patient with Ehlers-Danlos syndrome . A second interesting protein for this locus is CD9 that is the only candidate provided by both networks and that is known to regulate collagen matrix organization by interacting with Beta1 integrin .
In general, given the highly stringent criteria that we adopted and considering that the starting data underlying the two networks are completely independent, we propose that the 4 common candidates (Table 2) should be considered as those having the highest priority for experimental validation.
Since a recent study has identified high confidence candidate genes by integrating protein-protein interaction with phenotypic information , we evaluated the number of common predictions, and found that a candidate is proposed by both approaches for 7 loci. In 5 cases the candidate proposed by Lage et al. did not overlap with our predictions. The only, remarkable exception was Filamin C (FLNC), which was found as a candidate on chromosome 7q for both the 608423 (limb-girdle muscular dystrophy type 1F) and the 603511 (limb-girdle muscular dystrophy type 1D) OMIM phenotype entries. Interestingly, the FLNC gene has been previously implicated in myopathy, but mutations have not been reported in families mapping to the above loci. Thus, our results can be considered as further supporting evidence, pointing to the actual involvement of this gene in limb-girdle muscular dystrophy.
In the present report we have shown that the integration of massive gene expression data with phenotype similarity maps can allow to efficiently identify high-probability candidates for many orphan human genetic disease loci, even when these comprise hundreds of genes.
The comparison of the two different networks clearly showed that the Affy network is characterized by a much better signal/noise ratio than the Stanford network. Although this may be partially due to the different source material (normal tissues in the first case, mostly tumor cell lines in the second case) we think that this could also depend on the higher technical standards reached by oligonucleotide-based platforms. Nevertheless, it is also important to notice that the analysis of the Stanford network allowed the prediction of many candidates that were not obtained from the Affy network, indicating that different datasets can result in complementary predictions.
Interestingly, in both cases the gene coexpression criterion proved to be much more effective than it could be expected from previous work, where it has been mostly used in combination with other high-throughput information sources. We think that these results strongly underscore two critical points. The first is the importance of using a restrictive filter to select biologically relevant coexpression links, such as our phylogenetic filter selecting links which are under selective pressure and therefore more likely to imply functional relationships. The second is the usefulness of systematic phenotype analysis methods, which may capture disease similarities that could easily escape human operator-based approaches.
Although very significant under the statistical point of view, the overlapping between conserved-coexpression links and physical protein-protein interaction data appeared to be rather limited in absolute terms, strengthening the idea that these criteria may cover partially overlapping subsets of the functional interaction space. Our results strongly suggest that, in most of the cases, requiring the concordance of coexpression data and protein-protein interaction data may worsen, instead of improving, the performances of both methods. Therefore, we envisage the independent use of both types of evidence to predict functional relationships and candidate disease genes. However, in the limited cases for which these approaches provide convergent results, they can be used as strong additive evidence.
In conclusion, we propose that our method and our list of candidates will provide a useful support for the identification of new disease-relevant genes.
Stanford human-mouse conserved coexpression network. Each row contains a link between two Entrez-Gene identifiers.
(0.66 MB TXT)
Affy human-mouse conserved coexpression network. Each row contains a link between two Entrez-Gene identifiers.
(1.89 MB TXT)
The graphical representation of the network was obtained with BioLayout Express 3D (http://www.biolayout.org).
The authors have declared that no competing interests exist.
This work was funded by the Regione Piemonte CIPE Program to LS and by the Italian Ministry of University and Research through the FIRB program to FDC and PP.