In clinical genetics, it is often difficult to decide whether a quantitative variation in the genome is related to the observed phenotype, and predicting consequences of haploinsufficiency is challenging (Huang et al., 2010
). To understand the functional impact of a given CNV region, not only does the general issue of pathogenicity need to be answered, but also the question of which of the genes included in the CNV region are associated with which phenotypic abnormalities present in the patient. Such information is invaluable for clinical management. Patients with overlapping but different sized deletions or duplications might present with different phenotypes that correlate with the affected genes. For patient follow-up and screening procedures, the information that one patient might, for example, have a high cancer risk or a risk for developing diabetes or hypertension, whereas another patient does not, might have a huge impact on individual prognosis and treatment.
In this work, we developed a semantic algorithm for mapping model organism phenotype data to equivalent human phenotypic features. We used this algorithm to address the question of which genes in CNVs are most likely to be causally related to individual phenotypic features seen in the CNV based on the assumption that an abnormal dosage in a gene is likely to lead to similar phenotypic abnormalities as a loss- or gain-of-function mutation in the same gene. In this way, we are able to exploit the wealth of phenotypic information available for 5703 genes in model organisms for which phenotypes of mutations in the human orthologs are unknown (). For the 27 well-characterized CNV disorders analyzed in this work, we identified a total of 802 phenogram matches, i.e. genes in which a monogenic disease in humans or model organisms is associated with a phenotypic feature that is also seen in (or similar to) one of the features of the CNV disorder. In order to test the performance of our algorithm, we performed the identical analysis 5000 times on randomized data. On average, only 250 features were identified, and the maximum number of features found in any of the randomized runs was ~350 (). We performed an extensive literature search for previously reported phenotype associations (supplementary material Table S2
); comparison of the results of our algorithm revealed that we identified 457 previously reported associations. Additionally, we found 346 novel phenotype associations that, to the best of our knowledge, have not been previously recognized in the medical literature. Our algorithms might additionally be valuable to incorporate model organism data into other areas in human genetics, such as the prioritization of variants found in exome sequencing projects.
Our work has several limitations. In , empirical probabilities (P
-values) for the phenogram scores (SPG
) are given for each of the 27 CNV disorders. In total, 14 of the CNV disorders displayed statistically significant scores (P
<0.05), including clinically distinct disorders such as WAGR and Sotos syndrome. There are several possible reasons for the lack of statistical significance of the remaining 13 disorders, which could relate either to the limitations of our computational approach, inadequate phenotypic annotations, or lack of knowledge about the genes located within the CNV. An important limitation of the approach as implemented in the current work is that it depends upon the granularity of the phenotype descriptions. More broadly used, nonspecific descriptions of abnormalities, such as autism or intellectual disability, are not flagged as ‘statistically significant’ because they are so frequently used. The calculated P
-values are based on the IC of the phenotypic features, and the IC of intellectual disability is very low (IC=3.2) because so many genes (currently 392) are annotated to this term. Indeed, P
-values of the phenogram scores reported in correlate with the granularity of the phenotypic descriptions of the CNV disorders, shown as the average IC of the CNV phenotypes and the IC of the phenogram-matches; unsurprisingly, the P
-values also correlate with the size of the intervals, measured with respect to the absolute numbers of genes and the numbers of genes with available phenotype information (see figure 3.2 in section 3.9 of supplementary material Table S3
). Many recently characterized CNV disorders that have been delineated on the basis of array-CGH (comparative genomic hybridization) screening rather than clinical studies have substantially less-specific clinical pictures. Nonspecific clinical phenotypes and high phenotypic variability complicate diagnosis and could explain why diseases associated with microdeletions or duplications of 3q29 (Ballif et al., 2008
) and microdeletions of 15q24 (Andrieux et al., 2009
) do not score as well as more distinct CNV disorders. Although the current work concentrated on identifying statistically significant phenotypic matches, an implementation of our method as a clinical decision support system could be designed to show the best match or matches for both specific and less specific phenotypic abnormalities. We also note that the P
-value as calculated in this work is not a measure of the probability that the CNV is the cause of the disease phenotype, which is the type of hypothesis testing that one would use in a diagnostic setting. Rather, the statistical hypothesis is a measure of whether the phenotypic abnormalities associated with the individual genes within the CNV match the phenotype of the CNV disorder better than one would expect by random chance, which is a conservative way of evaluating the results of semantic phenotype matching. It is to be expected that some degree of phenotypic similarity to CNVs with complex phenotypes exists at many other loci in the genome. For instance, hundreds of distinct CNVs could be associated with phenotypes such as autism (Levy et al., 2011
), and indeed there is a high likelihood that a large deletion anywhere in the genome will be pathogenic and result in one or more abnormal phenotypic features (Vermeesch et al., 2007
). Therefore, the method presented in this work would need to be extended to include other data, such as previous reports of comparable CNVs in databases such as DECIPHER (Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources) (Firth et al., 2009
) and ISCA (Riggs et al., 2012
), to be useful as a clinical differential diagnosis support tool.
It is difficult to provide any direct experimental proof in humans that altered dosage of a specific gene is responsible for a specific phenotypic abnormality in a CNV disorder, unless monogenic lesions also occur in isolation in other patients. Rather, candidate genes are proposed based on the similarity of their single-gene mutation phenotypes to the CNV phenotypes; for instance, haploinsufficiency for ELN
was proposed as the cause of supravalvular aortic stenosis observed in individuals with Williams syndrome because point mutations in the ELN
gene also give rise to this phenotype. A total of 456 of the 802 associations identified by computational analysis in our study have been previously proposed in the literature, thereby supporting our computational approach (supplementary material Table S2
). To the best of our knowledge, 346 of the 802 associations offer novel candidate genes for individual phenotypic features. We examined associations from the literature that were not detected by our approach to determine the extent and possible reasons for such false-negatives. Some associations, such as FZD9
and ‘intellectual disability’ for Williams syndrome (Pober, 2010
), fell below the threshold for detection by our method because of their very low IC. Others, such as the association of genetic variants in STX1A
and ‘impaired glucose tolerance’ (Pober, 2010
; Romeo et al., 2008
), were not detected by our method because they were based on human association studies and are so far not reported in OMIM or any of the information sources used in this study. Inclusion of these kinds of data, for example from resources such as the Genetic Association Database (GAD) (Becker et al., 2004
) or GWAS Central (Thorisson et al., 2009
), in candidate gene prediction algorithms such as ours will be addressed in future projects.
For neurological and neuropsychiatric phenotypes such as intellectual disability, seizures, schizophrenia, mood disorders and autism, genetic heterogeneity and variable expressivity and penetrance are well-known features. Cytogenetic imbalances are the most frequently identified cause of intellectual disability (Aradhya et al., 2007
), and CNVs are increasingly being detected by array-CGH in individuals with neurological and neuropsychiatric phenotypes (Akil et al., 2010
). For such phenotypes, dysregulation of relevant neural circuits might be caused by disruption of single genes, but combinatorial effects of variations in many genes affecting shared pathways have also been proposed (Shaikh et al., 2011
). Similarly, clustering of functionally related genes has been proposed for bovine quantitative trait loci (Salih and Adelson, 2009
). We identified clustering of functionally related genes within CNVs as a second important factor for pathogenicity of CNVs in the human genome, not only for neurological phenotypes, but also for various other phenotypic features such as genitourinary, skeletal and metabolic abnormalities. We found evidence that genes involved in pheno-clusters are often functionally related to one another and tend to be near one another in the PPI network ().
There is abundant evidence now that there is functional clustering in all mammalian genomes. Presumably, the phenotypic clustering observed in our study is related to the clustering of functional neighborhoods of genes across chromosomes, which is even partially conserved across species (Al-Shahrour et al., 2010
). In some cases, clustering is associated with areas of strong linkage disequilibrium, suggesting that coinheritance of combinations of alleles of genes whose products interact or are associated with the same pathway or function might be the evolutionary driving force. Interestingly, functional clusters shared by different species do not always seem to consist of orthologs, suggesting that evolutionary pressure is exerted upon the cluster's function rather than the individual genes within it (Al-Shahrour et al., 2010
; Michalak, 2008
; Petkov et al., 2005
). To date, there has been no explicit global analysis of the clustering of gene function, location, process, pathway or expression patterns involved in human CNVs, but the possibility of epistatic relationships between these genes would be predicted to be strong. There is, however, some evidence that certain functional gene classes are overrepresented in areas of the genome containing common CNVs (Conrad et al., 2010
). Our data provides for the first time a breakdown of the ‘phenotypic readout’ from regions involved in CNVs and strongly suggests that they contain functionally clustered genes (Michalak, 2008
; Petkov et al., 2005
; Stranger et al., 2007
). The results of our study shed new light on the pathobiology of human CNVs and provide evidence that the concept of clustering of phenotypically related genes plays an important role in genome pathology.
Another important aspect of our study is that 377 of the 629 genes analyzed did not have any human or model organism phenotype information. Thus, systematic genome-wide phenotyping efforts such as the International Mouse Phenotyping Consortium (Brown and Moore, 2012
) and corresponding efforts in zebrafish (Kettleborough et al., 2011
; Wang et al., 2007
), such as the Zebrafish Mutation Project, have great potential to provide additional insights and candidates for genes involved in human disease. Algorithms such as ours that make use of phenotypic similarities between human and model organisms will facilitate the computational integration of information from these projects, harnessing these increasingly rich resources to help us understand the consequences of human mutation and functionally dissect the human genome. Our algorithms can be adapted to assist with interpretation and understanding of the diagnostic results from array-CGH analyses. Similar algorithms can be developed for interpreting next-generation sequencing data, thereby moving closer to the objective of a personalized genetic approach to medical care.