There are now many published reports of the significant role of rare, de novo CNVs with major phenotypic effects in various human disease populations, including intellectual disabilities, autism spectrum disorders, epilepsy, and schizophrenia, among others. Many of these studies are based on well-phenotyped research cohorts that were originally collected and characterized to optimize the ability to detect small effects in genome-wide association studies. Although positive associations have been identified for a few common diseases through these efforts, a surprising and remarkable finding has been the identification of rare, de novo CNVs with major phenotypic effects, particularly in neurocognitive and behavioral disorders. Because these events are rare, obtaining adequate evidence for their functional role in disease causation requires very large sample sizes as well as large control populations.
An alternative model for assessing the contribution of CNVs to disease, which has been utilized particularly in the study of children with unexplained developmental disabilities and congenital anomalies, has been the reporting of case series from clinical laboratory testing. Most of these published studies have represented CNV data from single laboratories and were based on previous generation targeted array analysis using bacterial artificial chromosome (BAC) genomic clones.5
Compared to analysis of research cohorts of well-phenotyped patients, the amount and quality of phenotypic data associated with clinical laboratory referrals is often quite limited.
For this study, we have combined these two approaches by exploiting a large CNV dataset derived from a consortium of clinical laboratories to explore the frequency and functional significance of rare CNVs. Our analysis of the first 15,749 ISCA cases, one of the largest CNV studies to date, has confirmed the power of this approach. We have defined the frequency (17.1%) of pathogenic CNVs in a cohort of individuals with intellectual and developmental disabilities and performed formal case-control studies of selected recurrent genomic regions whose frequency was sufficient for statistical analysis.
The determination of whether a CNV contributes to an abnormal phenotype depends on many factors, including gene content, previous evidence of pathogenic CNVs in the region, type of CNV (deletion or duplication), inheritance pattern, and frequency in unaffected populations. As such, larger CNVs may be more likely to be classified as pathogenic since they have a higher chance of including a dosage-sensitive gene and/or they include a larger number of genes that cumulatively result in an abnormal phenotype. Our experience, as well as that of other groups,53
has shown that the classification of a previously unreported CNV not associated with known disease genes can vary. To address such discrepancies, we used case-control statistical evidence for 14 selected recurrent CNV regions to objectively determine their significance.
We analyzed deletions and duplications of each region separately, resulting in 28 total recurrent CNV regions. Using this approach, we demonstrated and confirmed the pathogenic nature of 20 recurrent regions. For the 16p11.2 duplications that had previously been reported as uncertain in the literature, we were able to re-classify this CNV region as pathogenic. Overall, we conclude that 21 out of the 28 recurrent CNVs examined should be considered pathogenic and provide a clinical diagnosis for any individual harboring a CNV of these regions.
The statistical approach we used to classify recurrent CNVs and the results we obtained are useful tools for researchers and the clinical community in interpreting whether a CNV has pathologic effects. However, while such statistical analysis is possible for recurrent CNVs, where the frequency is high, this strategy is more difficult for the remaining ~75% of CNVs, which are not mediated by segmental duplications and are individually very rare. Therefore, other approaches need to be explored to address this class of CNVs. One possibility for these highly heterogenous CNVs is to analyze all genomic intervals of a defined size (e.g., 500 kb or 1 Mb) or to use a “sliding-window” analysis to examine overlapping genomic intervals along the length of each chromosome. By comparing structural variation observed in cases to controls, disease-causing regions can be differentiated from those associated with normal variation by using the control data to define regions of the genome where dosage changes can be tolerated without overt phenotypic effects. Since non-recurrent CNVs are very rare events, the collection of data from hundreds of thousands of cases will be needed for this type of analysis to be successful. Continued efforts of the ISCA consortium, as well as other databases such as DECIPHER, will be essential to this process in order to obtain enough overlapping CNVs to provide the power needed for statistical analyses.
The ISCA consortium is continuing to grow and now includes over 150 clinical laboratories from across the world. Given the rapid increase in utilization of this testing on a routine clinical basis, and the ability to recruit an expanding number of collaborating labs contributing data to a central database, the size of this cohort will continue to rapidly grow, providing a highly cost-effective way to obtain very large CNV datasets. In addition, since this data will be publicly available through two NCBI resources, dbGaP (database of Genotypes and Phenotypes) and dbVar, this resource can be readily accessed by researchers as well as the clinical community. Having large datasets from individuals with abnormal phenotypes will foster more objective formal scientific analyses to predict which CNVs will impact human development. Such efforts will make it possible to develop a whole-genome dosage map in humans to determine which genes and regions are subject to haploinsufficiency or triplosensitivity compared to those that are tolerant of dosage changes.