|Home | About | Journals | Submit | Contact Us | Français|
To understand the genetic heterogeneity underlying developmental delay, we compare copy-number variants (CNVs) in 15,767 children with intellectual disability and various congenital defects to 8,329 adult controls. We estimate that ~14.2% of disease in these individuals is due to large CNVs > 400 kbp. We find greater CNV enrichment in patients with craniofacial anomalies and cardiovascular defects than epilepsy or autism. We identify 59 pathogenic CNVs including 14 novel or previously weakly supported candidates. We refine the critical interval for several genomic disorders such as the 17q21.31 microdeletion syndrome and identify 940 candidate dosage-sensitive genes. We also develop methods to opportunistically discover small, disruptive CNVs within the large and growing diagnostic array datasets. This evolving CNV morbidity map combined with exome/genome sequencing will be critical for deciphering the genetic basis of developmental delay, intellectual disability, and autism spectrum disorders.
Large copy number variants (CNVs) are enriched in the aggregate among severe cases of pediatric disease including neurological and congenital birth defects1,2 as well as neuropsychiatric diseases3–5. Clinical interpretation of individual loci has been problematic for several reasons. First, except for CNV “hotspots” flanked by duplications prone to unequal crossing over and elevated de novo mutation rates6,7, disease associations for many individual CNVs remain unclear due to their rarity and the need to screen extraordinarily large sample sizes. Second, even for CNVs with clear pathogenicity, the dosage-sensitive genes that underlie the phenotypes observed have generally not been identified because the CNVs are large and encompass many genes. Finally, considerable variation in expressivity is often observed, with the same lesion contributing to different disease outcomes8–12. Thus, while their disease risk in general is well established, the phenotypic consequences for most large CNVs are not well characterized nor have these effects been fine mapped. Here, we leverage a collection of data from 15,767 children with various developmental and intellectual disabilities and compare them to a CNV map we generated from 8,329 adult controls. We present the first detailed genome-wide morbidity map of developmental delay and congenital birth defects. Striking differences in the CNV landscape are revealed including potentially pathogenic genes, refinement of known disease-causing mutations, and the discovery of potentially novel genes, including the development of methods to opportunistically discover smaller disruptive CNVs from clinical datasets.
We analyzed 15,767 DNA samples from children referred to Signature Genomic Laboratories, LLC, with a general diagnosis of intellectual disability (ID) and/or developmental delay (DD), although we note that this ID/DD cohort also includes a constellation of phenotypes including, but not restricted to, congenital malformation, hypotonia and feeding difficulties, speech and motor deficits, growth retardation, cardiovascular and renal defects, epilepsy, hearing impairment, craniofacial and skeletal features, and behavioral issues. Overall 73% of cases suffer from ID/DD and/or autism spectrum disorder, while 12% of cases were not annotated. The remainder were classified with various congenital abnormalities. Detailed phenotypic information is limited to 48.4% of the cases where specific subclassifications could be made, including 575 cases with cardiovascular defects, 1,776 cases with epilepsy/seizure disorder, 1,379 with autism spectrum disorder, and 3,898 with craniofacial defects (Supplementary Tables 1 & 2).
DNA samples obtained from whole blood were analyzed by customized array comparative genomic hybridization (CGH) at an average probe density of ~97,000 oligonucleotides, sufficient for reliable genome-wide detection of CNVs >300 kbp and for targeted detection of events >40 kbp for approximately one-fourth of the genome13. After filtering, a total of 16,526 rare (< 1% population frequency) autosomal CNV calls were made with an average of 1.05 CNV events per individual (median size 213 kbp). Using a customized higher density microarray and fluorescent in situ hybridization, we validated 402/425 CNVs (precision of 0.945) greater than 150 kbp (Supplementary Note, and Supplementary Table 3). Similarly, manual inspection of calls with low log ratios or z-scores (absolute values of <0.25 and <1.5, respectively) suggests a false discovery rate of 0.0138. For comparison, we identified CNVs from a control set of 8,329 adult samples assayed using multiple Illumina genome-wide single-nucleotide polymorphism (SNP) microarrays. These samples were studied as part of genome-wide association studies (dbGaP) for phenotypes unrelated to neurological disease (e.g. lipid concentration levels, blood pressure, asthma, etc.) (Supplementary Table 4). CNVs were called using a Hidden Markov Model (HMM)–based discovery method 14 with an overall precision of 0.892 in identifying large CNVs (>100 kbp) (validation rates of 6/615 and 19/2216). From this dataset, we identified 446,736 CNVs with an average of 53.6 events (rare and common) per individual (median size 1.9 kbp). Due to the increased probe density (most >550,000 probes), our control dataset provides increased CNV detection power and resolution when compared to the disease dataset, reducing the potential for spurious CNV enrichments within cases (see Methods).
We compared CNV content between the cases and controls excluding common CNVs (>1% population frequency). Consistent with previous studies of pediatric neurological disease3–5,17,18, we find a significant excess of large CNVs among cases relative to controls. This excess is evident at 250 kbp and becomes more pronounced with increasing CNV size (Figure 1A). For example, at a threshold of 400 kbp, ~25.7% (4,047 cases) of ID/DD children harbor an event of at least this size compared to 11.5% of controls, suggesting that an estimated 14.2% of ID and DD is due to the presence of CNVs >400 kbp in length (OR = 2.7, p = 5.86×10−158). At a threshold of 1.5 Mbp, we identify 1,782 (11.3%) affected individuals versus only 52 (0.6%) controls (OR = 20.3, p = 6.87×10−266) and at a threshold of 3.0 Mbp the odds ratio jumps to 47.7 (p = 1.68×10−197). There is a remarkably strong correlation (R2 = 0.97) with the de novo rate as a function of increasing CNV size, with 50% of events at 1 Mbp reported as inherited (Supplementary Figure 1).
We find 1,492 CNVs in 1,400 individuals within 45 known genomic disorder regions (Table 1, Supplementary Table 5). Among these, deletions are twice as common (n = 954 deletions vs. 538 duplications) and show greater average penetrance (96.3%) when compared to duplications (94.3%). We note that “classic,” phenotypically well-defined syndromes known to result from CNVs (e.g. Smith-Magenis, Williams syndrome, etc.) are underrepresented here relative to other cohorts of individuals with similar phenotypes (Supplementary Table 6), suggesting that our estimate of CNV burden in ID/DD is not upwardly biased by ascertainment for known CNV carriers.
Examining the size distribution of CNVs in the context of major subphenotypes shows that the large CNV burden is increased in more severe developmental phenotypes associated with multiple congenital abnormalities. We find, for example, that children also diagnosed with craniofacial and cardiovascular defects show a significantly increased burden of large CNVs when compared to children with autism spectrum disorder (p = 4.99×10−10 and 6.45×10−5, respectively, at >400 kbp) (Figure 1B). Children with an additional diagnosis of epilepsy/severe seizure disorder tend to have a more intermediate CNV burden when compared to individuals with autism or more severe ID (Supplementary Figure 2). These distinctions remain significant even after excluding CNVs larger than 10 Mbp (which would have been detectable by karyotype analysis) and when the CNV burden among the subset of controls screened for psychiatric disease is used as the baseline, demonstrating a role for large CNVs in more severe phenotypic variation.
A comparison of the CNV landscape between cases and controls reveals striking differences and some general genomic architectural features (Figure 2). To ameliorate the effects of breakpoint imprecision and multi-platform comparisons, we contrasted the number of deletions (or duplications) present in cases versus controls in 200 kbp windows along the human genome using a Fisher’s exact test (Supplementary Table 7, Supplementary Figure 3). This analysis identified 80 genomic regions that were at least weakly enriched for CNVs (counting deletions and duplications separately) among cases (at least five windows with p < 0.1), 27 of which exhibit strong evidence for enrichment (p < 0.001). Notably, 27.5% (22/80) of the enriched CNV-loci reside at genomic hotspots flanked by large (>10 kbp) blocks of highly similar (>90%) segmental duplication (SD) and include most known genomic disorders (Supplementary Table 7). An additional 46 enrichments represent large CNVs near telomeres (Supplementary Figure 4). While we observed enrichments at one or both ends of all chromosomes, 12 chromosome ends showed particularly strong (p < 0.001) enrichment. Of the 80-CNV loci, 15 are novel or are supported by isolated case reports (Table 2). Additional phenotypic details for CNV carriers, including ethnicity and inheritance status, at each of these 15 CNV-loci is available in Supplementary Table 8, in some cases with comparison to similar CNVs observed in case reports from the DECIPHER database19. We note that one of these 15 (duplications at 10p15.3) appears to be enriched among cases as a consequence of allelic stratification between African and European populations and was eliminated from further consideration (see Methods and Supplementary Note).
Among the 14 novel ID/DD CNV-loci, we identified a 660 kbp deletion mapping to chromosome 15q25.2 flanked by SDs (69.8 kbp, 98.6% identity) (Figure 3A). The deletion is absent in the controls analyzed here and the Database of Genomic Variants (http://projects.tcag.ca/variation/), but present in five affected individuals (including two siblings) among the ID/DD sample set. Clinical aspects of the probands were variable consisting of neurologic features and DD (Supplementary Table 9); one female had only mild motor delay associated with a congenital myopathy but was otherwise cognitively normal. The two brothers with the deletion both had autism spectrum disorders but additional family members were not tested (Supplementary Note). A previous meta-analysis of patients found this deletion in 4 of 6,860 cases16 with schizophrenia and autism compared to 0 of 5,674 controls (combined with this study, p = 0.037 after excluding one sibling). Thus, while statistical significance remains modest and population stratification cannot be definitively ruled out (see Supplementary Note), these data suggest a potentially new genomic disorder that will be observed at a frequency of 1/3,000 referred cases.
One of the most common genomic hotspots in this study is 15q11.2 (NIPA1), a 292 kbp deletion whose pathogenicity has been considered uncertain4,20. In terms of frequency, the 15q11.2 deletion is second only to VCF/DGS deletion, and our data indicate it is significantly enriched (OR = 2.36, p = 2.5×10−5) albeit at lower penetrance (0.83) than most other genomic disorders. In addition, we find support for the pathogenicity of duplications of obesity-associated 16p11.2 (SH2B1)21,22 and epilepsy-associated 15q13.3 (CHRNA7)23. We also analyzed 111 regions of the human genome predicted to be prone to recurrent microdeletions and microduplications based on the presence of homologous SDs at their flanks in the reference assembly6. Of these potential hotspots, 62 harbored CNVs likely mediated by NAHR between the flanking SDs (“active hotspots”), while the remaining 49 did not. The presence of SDs in direct, as opposed to inverted, orientation is a key distinction between active and inactive hotspots (46/54 direct vs. 16/57 inverted; OR = 3.04). We also found that SDs flanking active hotspots are larger and show higher sequence identity compared to inactive hotspots (Kolmogorov-Smirnov test, p = 0.0022) (Supplementary Figure 5). Interestingly, eight regions were identified that showed no evidence of copy number variation in cases or controls despite the presence of large, highly similar, and directly oriented SDs at their flanks (Supplementary Table 10). These may be regions that are mutationally active but in which dosage imbalance is lethal (e.g. 7p14.3 flanked by 19.9 kbp duplications and containing BBS9 and BMPER).
In addition to identifying new potentially pathogenic loci, the large number of cases provides the opportunity to identify atypical deletions (i.e. characterized by noncanonical breakpoints and likely generated by a non-NAHR mutational mechanism) and refine the critical region of known genomic disorders. For example, we identified three individuals with smaller, atypical deletions within the 17q21.31 microdeletion syndrome region18,24,25 (Figure 3B). These patients’ breakpoints contrast with those of 23 patients carrying the canonical 480 kbp deletion mediated by unequal crossover between directly orientated SDs—a genomic architecture largely restricted to individuals of European descent26. Detailed clinical information on two individuals with the atypical deletion (Figure 3C), showed strong phenotypic similarity with the known syndrome including a pronounced philtrum, epicanthic folds, cupped ears and skeletal defects of the hand (Supplementary Note, Supplementary Table 11). The strong phenotypic similarity refines the dosage-sensitive region to only three genes (Figure 3B), including MAPT, which is disrupted by one of these atypical deletions.
Encouraged by the additional refinement provided by atypical deletion events, we performed a gene-based analysis on the complete ID/DD dataset, as well as on patient subsets partitioned by additional phenotypic data. We identified 615 genes as significantly deleted in any phenotype (Benjamini-Hochberg corrected p < 0.05; Supplementary Table 12), the vast majority of which associated with known pathogenic loci or subtelomeric alterations. An Ingenuity Pathways Analysis (IPA) (www.ingenuity.com) showed significant enrichment in expected functional categories (e.g. cardiovascular disease, developmental, endocrine system and developmental disorders).
We then expanded our analysis to include candidate associations with nominal significance, as the above analysis is likely to be overly conservative due to the high level of dependence between neighboring genes. An IPA of genes with a nominal p < 0.02 identified the same functional categories as above suggesting that a large proportion of the nominally significant genes are likely relevant to morbidity. In addition to identifying genes within known genomic disorders, this analysis identified genes outside of these intervals. For example, we observed an excess of smaller deletions of SCN1A specifically in patients with epilepsy (p = 0.019), consistent with the literature27. CD44 deletions on 11p13 are significant in craniofacial cases (p = 0.010) and have previously been linked to cleft lip and palate in SNP and expression microarray studies28,29. A region on 9p24 containing five genes is significant in craniofacial cases, with the peak significance focused at SLC1A1 (peak p = 0.00172), a high affinity glutamate transporter previously implicated in multiple neurological conditions30. This peak, specific to SLC1A1, is also significant in neurological, craniofacial and epilepsy cases. A 2q37 deletion immediately proximal to the 2q37 deletion region (Table 1) containing 15 genes is enriched primarily in neurological (modal p = 0.00479) and epilepsy (modal p = 0.00542) phenotypes and contains genes associated with neurodevelopmental and sleep phase disturbances (GBX2 and PER2)31,32. Finally, the deletion of PARD3 is significant in autism (p = 0.01023). PARD3 has been previously associated with bipolar disease33 and is involved in both tight junctions formation and axonal fate determination34.
We also identified 325 duplicated genes (Supplementary Table 12) significantly enriched among the patients (Benjamini-Hochberg corrected p < 0.05). As for deletions, nearly all genes enriched among duplications at this stringent threshold were within known pathogenic duplications and were overrepresented (IPA) in categories that fit well with the expected phenotypic abnormalities (e.g. cardiovascular disease, developmental, endocrine system and developmental disorders). Expanding our analysis to enrichments with nominal significance identified IPA functions identical to the conservative approach as well as several promising candidate gene regions. We observed duplications containing three genes (SH3YL1, ACP1 and FAM150B) on chromosome 2p in cases with craniofacial disorders (p = 0.01032). Notably, large 2p distal duplications have been associated with facial dysmorphism in multiple case reports35,36. Similarly, we observed duplication of two genes (RSPO4 and PSMF1) on distal chromosome 20p in cases with cardiac defects (p = 0.01195), and larger duplications of 20p have been associated with cardiac defects37. The results suggest a potential role for these small subtelomeric regions in disease. Finally, we observed duplication of proximal 8p extending to include two genes in cases with neurological disorders (p = 0.00479), one of which (FNTA) has been shown to be more highly expressed in schizophrenia38.
While the data suggest that as much as 14.2% of DD may be explained by large CNVs, many causal mutations remain to be identified. We sought to determine if novel, smaller CNVs could be identified among these patients assuming that breakpoints would not necessarily be recurrent and individually relevant events would be rare (<0.1%); such variants may, in principle, identify novel candidate genes, refine the molecular basis for the phenotypic consequences of larger CNVs, and broaden the predictive power of a given microarray experiment. Therefore, we conducted a directed search for small, exon-affecting CNVs, reasoning that such variants are more likely to have disease relevance and be amenable to follow-up. For each consensus coding sequence (CCDS) exon39, we determined the average intensity for the three closest probes (termed a “cassette”) in each sample and, in turn, identified cassettes exhibiting outlier intensities that may be indicative of deletions (see Methods, Supplementary Figure 6). Note that because this strategy is exon-centric, it is partially platform and breakpoint independent. We analyzed 186,014 autosomal coding exons using 65,704 cassettes (multiple exons are often targeted by the same cassette), excluding exons within known common CNVs16,40,41. After a series of data normalization and quality-control steps, we identified 829 cassettes in which a small (10–100) set of samples exhibited probe intensities that clustered well below the population-wide mean. Each of these was manually reviewed to eliminate artifacts and select for genes with greater potential for disease involvement; 19 were selected for follow-up and organized into two subjectively defined tiers of quality (Table 3).
Among the “first tier” of predicted deletions, we found that 55 of 58 individual (i.e. sample-level) predictions validated, with at least one validated event for all 10 examined genes, and for the “second tier,” we found that 25 of 40 predictions validated across seven of the nine examined genes. A total of 44 of the validated deletions spanned only a single probe on the originally used array (Supplementary Figure 7). Deletion events at three genes were determined to be polymorphisms42–44. Interestingly, we found PARK2 to contain at least six distinct exon-affecting deletions ranging in size from 118 to 315 kbp (Figure 4, Supplementary Note, Supplementary Figure 8). However, there is no evidence for CNV enrichment at this locus among cases as this phenomenon also holds true for control samples (Supplementary Figure 9), suggesting that PARK2 is a fragile gene prone to recurrent deletion events. We also identified small deletions in TBX5, a gene known to cause Holt-Oram syndrome45 (a disorder characterized by upper limb abnormalities and congenital heart defects; OMIM #142900). We found that 7 of 15 samples predicted to harbor a TBX5 event were fetal samples, a rate significantly greater than the background proportion of fetal samples (13.4%, p = 0.0019), consistent with the observations that TBX5 mutations can result in prenatal abnormalities detectable by ultrasound46.
We present one of the largest studies investigating the role of rare CNVs in ID and DD, analyzing data from 15,767 affected individuals and 8,329 controls. These data quantify the massive contribution of large CNVs to pediatric disease, with 25.7% of affected individuals harboring CNVs >400 kbp in contrast with only 11.5% of controls. Disease risk increases steadily in relation to CNV size, with an odds ratio >20 for carriers of CNVs larger than 1.5 Mbp and nearly 50 at a threshold of 3 Mbp. We find that the CNV burden differs significantly depending on the nature of the primary clinical referral, with craniofacial abnormalities and structural defects of the heart being especially enriched for large CNVs relative to epilepsy and autism spectrum disorder (Figure 1, Supplementary Figure 2). As has been observed in model organisms and predicted based on theory47,48, haploinsufficiency appears more common and penetrant than triplosensitivity for severe developmental phenotypes. While this cohort does not represent a random sampling of individuals with ID/DD and includes some individuals without ID or DD, our estimates are likely applicable to ID/DD in general. For example, the average CNV burden across 15 genome-wide studies of ID/DD (combined sample size of 1,021) was estimated to be ~13.7%, similar to our estimate of 14.2%, in a literature survey by Miller et al.49 (note that this estimate was derived by averaging the diagnostic yields for all studies with a genome-wide resolution of 1 Mbp or better as indicated in Table 2 of Miller et al.). Furthermore, the observed enrichment for many loci known to contribute to ID/DD risk (Table 1) and individual genes previously identified to be disrupted among affected individuals (Supplementary Table 12) clearly supports the applicability of the inferences generated here for both ID/DD specifically and neurological disease (e.g. schizophrenia, autism, etc.) in general.
Practically, these data serve as a clinical resource useful in diagnostics (Tables 1 and and2).2). The large number of controls and cases provides estimates of penetrance for 60 pathogenic CNVs (accounting for ~10% of cases) and sheds light on either ambiguous or previously unknown pathogenic variants, including 14 novel or previously marginally supported CNV loci that collectively represent ~0.7% (112 of 15,767, Table 2 and Supplementary Note) of the individuals studied here. We note that while one CNV-locus (10p15.3 duplications) appeared to be enriched among cases as a result of ancestry differences between cases and controls, the aggregate ethnic composition of the 14 loci in Table 2 matched closely our control dataset (see Supplementary Note, Supplementary Figures 10 and 11), suggesting that population stratification for rare variants is unlikely to explain the enrichment at these loci. The size distribution (median of 940 kb), inheritance rate (15 of 34 tested CNVs are de novo, with at least 1 de novo variant observed in 6 of the 14 loci), and overlap with DECIPHER entries further support the disease risk for these CNV-loci.
Among these potentially novel CNVs, we provide additional support for a genomic disorder mapping to 15q25.2, which we find in five affected individuals (including two affected siblings) and zero controls (Supplementary Figure 12). Our results combined with earlier studies of schizophrenia and autism (four cases vs. zero controls)16 implicate this CNV as a high-risk allele for pediatric neurological disease with variable outcomes (Supplementary Note, Supplementary Table 9) as well as neuropsychiatric disease (p = 0.037). In addition, our data support the pathogenicity of CNVs at 2q13 whose significance was uncertain because they were observed in a small number of control samples50. In our study, we observed 12 deletions (p = 0.032) and 9 duplications (p = 0.022) on chromosome 2q13 in patients but only one deletion in controls. We furthermore find an enrichment of the deletion in cardiovascular cases (peak p = 0.012) and the duplication in cases with craniofacial features (peak p = 0.010). These results are consistent with two previously reported deletion cases with multiple heart defects and two duplication patients with various facial and skeletal features50. Additionally, our data support the pathogenicity of duplications at 16p11.2 (SH2B1), duplications at 15q13.3 (BP3-BP5; CHRNA7), and deletions at 15q11.2 (BP1-BP2; NIPA1). The latter are present in ~1 in 167 affected individuals studied here and, although incompletely penetrant (0.83), are likely strong risk factors for DD in addition to schizophrenia4,51.
Finally, the discovery of atypical and smaller deletions among patients with virtually identical phenotypes helps to refine the smallest region of overlap for known syndromes. The atypical deletions of 17q21.31 exclude deletions of CRHR1 as playing a role in this syndrome (although deletions of long-range regulatory elements that change CRHR1 expression cannot be ruled out) and narrow the likely candidates to three genes, including MAPT, which is disrupted by proximal breakpoints in two cases (Figure 3B). Overall, we identified 615 deleted genes and 325 duplicated genes significantly enriched in cases when compared to controls. The dosage imbalance of these genes should not be considered as proven but rather as candidates with higher prior probability of dosage sensitivity for future studies. It is encouraging that this set includes a number of previously hypothesized and novel associations between genes and particular traits (Supplementary Table 12). In addition, our data show that even older, low-resolution microarray data afford discovery opportunities for CNVs that have not previously been detectable. Indeed, we successfully identified and confirmed dozens of small deletion events, several of which have plausible disease roles (e.g. TBX5 deletions and Holt-Oram syndrome), including many detected by only a single probe in the original microarray experiment. As the underlying raw data from diagnostic laboratories becomes released, prospectively, there will be great potential for finding additional exon-altering deletions. Further validation of these and other novel candidates will yield new insights into the specific phenotypes affected by the loss or gain of individual genes. While most arrays cannot robustly capture the small deletions we identified, such as those adjacent to exons of FGF9 and LYST (associated with Chediak-Higashi Syndrome), control screening using PCR or other targeted high-throughput assays may be used to follow-up individually interesting candidates (Supplementary Note).
We predict that this map of CNVs and potentially dosage-sensitive genes will be invaluable for both clinical and research purposes in the future. For example, Boone et al. used an exon-targeted microarray to identify a number of individual gene disruptions in individuals with ID/DD of plausible but uncertain pathogenicity given their rarity. We find support for a number of these genes, including two—CCREBBP and SLC1A1—that are significantly enriched among individuals here with similar phenotypes to those previously described (Supplementary Note). As genomic discovery efforts—especially exome sequencing—expand, the results described here should prove increasingly important to clinicians and researchers faced with the challenges of linking rare disruptive mutations to pediatric diseases.
Samples from individuals with ID/DD and related phenotypes were submitted to Signature Genomic Laboratories, LLC, mostly from the U.S. and Canada, for clinical microarray-based CGH; a total of 15,767 samples were analyzed and 16,526 rare autosomal CNV calls were detected (Supplementary Table 1) and deposited into dbVar (dbVar study accession nstd54)52. Informed consent was obtained to publish clinical information and photographs and to further characterize the CNVs present in the individuals with detailed information presented in this paper, using a protocol approved by the Institutional Review Board. Although not a random set of children with ID/DD, the presentations are representative of those observed in a clinical diagnostic setting. The majority of the individuals have an ID/DD phenotype; however, clinical features such as craniofacial and skeletal features, growth retardation, cardiovascular and renal defects, hypotonia, speech and motor deficits, hearing impairment, epilepsy, and behavioral problems were also documented. We identified 575 cases with cardiovascular defects, 1,776 cases with epilepsy/seizure disorder, 1,379 cases with autism spectrum disorder, 3,898 cases with craniofacial defects, and 8,772 cases with general neurological defects; many individuals had multiple subclassifications (Supplementary Table 2). Self-reported ethnicity was available for 144 individuals, with 75% (95% CI 67.3–81.4%, 108/144) reporting Caucasian (primarily European descent), 6.9% (95% CI 3.8–12.3% 10/144) African American, and 18.1% (95% CI 12.6–25.1% 26/144) as other. These samples were analyzed across nine custom array-CGH platforms, with most tested on an Agilent array with ~97,000 probes (Supplementary Figure 13).
Controls were not ascertained specifically for neurological disorders, but all were obtained from adult samples providing informed consent so developmental disorders should be exceedingly rare. Of individuals with known ethnicity, 81.2% are Caucasian (primarily European descent), 2% are African/African American, and 16.5% are other/mixed ancestry. Due to the slight enrichment of African-American cases compared to our control samples, we modeled the potential impact of large CNV stratification and found no evidence for an overall enrichment of unique large CNVs in the African cohort (Supplemental Figure 10). DNA was obtained from cell lines and blood-derived samples generated for association studies of various phenotypes. Datasets are detailed in Supplemental Table 4. Data were obtained from the following sources: HGDP16,53; NINDS (dbGaP accession no. phs00008916,54; PARC/PARC2)55,56; London (parents of asthmatic children)15; FHCRC (pre-release data provided courtesy of Aaron Aragaki, Charles Kooperberg, and Rebecca Jackson as part of an ongoing genome-wide association study to identify genetic components of hip fracture in the Women's Health Initiative); InCHIANTI (data provided by InCHIANTI study of aging; http://www.inchiantistudy.net15,57); and WTCCC2 (NBS)58. Control CNV arrays were analyzed as described previously16. Briefly, a Hidden Markov Model (HMM) based on both allele frequencies and total intensity values was used to identify putative alterations, followed by manual inspection of large CNVs (>100 probes and >1 Mbp) in conjunction with user guided merging of nearby (<1 Mbp between for arrays with <1 million probes and <200 kbp for arrays with >1 million probes) calls, which represent a single region broken up by the HMM, or gaps. All samples on arrays with densities <1M probes were filtered by a maximal genome-wide LogR standard deviation of 0.25, while the high density 1.2 million probe WTCCC2 data was filtered using an increased standard deviation cut-off of 0.37. Large alterations with noncanonical allele frequencies indicative of mosaics were excluded due to the high likelihood of these resulting from cell culture immortalization. For the two datasets where the Illumina array mapping corresponded to build35 (NHGRI), we utilized the autosomal calls generated previously16 and mapped the coordinates to build36 using the UCSC LiftOver tool (http://genome.ucsc.edu).
Microarray platform heterogeneity may yield false CNV enrichments signals as a function of differential detection power related to probe density, data quality, analysis methods, etc. We made a number of efforts to control for such potential effects and believe our study design is robust to this source of error for a number of reasons. First, the control data for this study were generated on higher resolution platforms (317,000 to 1,200,000 probe Illumina SNP arrays, with 88% of controls being profiled on 550,000 probe or higher density platforms) compared to the case data (median array is ~97,000 probes, highest density is ~130,000). As a result, our CNV detection power is substantially higher for cases than controls; notably, such differences will tend to manifest as false positive enrichments for CNVs in controls while we are focused exclusively on enrichments within cases. Second, we rigorously eliminated potential sources of errors in the case CNV data with a combination of both manual and automated filters, including calls with low probe counts, high degrees of overlap with segmental duplications in the reference assembly, and likely reference-sample CNVs. Third, for the sliding window enrichment tests we eliminated all CNVs in cases that spanned fewer than 10 probes on the lowest resolution (HH317K) control SNP array. Fourth, we have validated 402/425 CNVs and determined the precision in cases to be high in general (0.945) and higher in cases relative to controls (0.892). Fifth, we specifically analyzed the 14 potentially pathogenic CNVs (Table 2) for control SNP microarray performance. 11/14 loci harbored small CNV calls within the region of interest from multiple control studies; as CNV calling algorithms tend to demonstrate increased sensitivity to larger alterations, we consider this to indicate sufficient control sensitivity within these loci to detect large CNVs. The remaining three loci are split between the minimal common region on 1q24.3, which demonstrates a single 72 kbp CNV in controls (again suggesting detectability of larger events), and two loci that harbor very small CNVs detectable only on the highest resolution 1.2M probe arrays. These two regions have high probe coverage on the 550K control array (46 probes within the smallest 6p22.3 Signature call and 40 probes in the MCR of 2q24.3). Further, all of these regions demonstrate de novo CNVs in our samples, supporting the hypothesis that these are pathogenic loci and not simply common copy number variants that we failed to detect with SNP platforms.
Control CNVs were merged into CNVRs by comparing each CNV to all of its overlapping partners and merging those with 50% reciprocal overlap. These CNVRs were then analyzed in the context of sliding 300 kbp genomic windows to identify regions of high variability (Supplementary Figure 9, Supplementary Table 13). Regions of high SNP diversity were obtained from Kidd et al.44 and used to identify regions where the breakpoint variability is likely to result from general sequence variation (such as the HLA locus on 6p). To perform a gene-based search for highly variable loci, we first generated a merged RefSeq list that combined overlapping splice variants into a single, large gene definition. We then analyzed these loci in the context of overlapping gain and loss CNVs that either contained the entire gene, overlapped the transcript (gene breaking or exon hits), or were contained within an intron. Finally, we analyzed each gene in the context of the number of unique CNVRs that overlapped the gene space (exonic or intronic).
For a subset of 11,529 samples, we identified for each coding exon39 the three closest probes, requiring at least one probe on both sides within 100 kbp of the exon. We required that all probes map within 200 kbp, yielding 65,704 unique cassettes targeting 186,014 autosomal coding exons. We then determined the average cassette intensity for each sample and normalized this by array type. Subsequently, we considered filtered cassettes by the following criteria: 10–100 samples with scores at least 5 standard deviations below average; the subset of samples at less than 5 standard deviations below average compose at least 10% of samples with scores less than 3 standard deviations below average (a measure of cluster separation); and no overlap of the target exon (note, individual probes were not filtered given the heterogeneity of platforms and potential for atypical CNVs) with common copy number polymorphisms or deletions seen in multiple control individuals16,42,43,59. This yielded 829 candidates for follow-up, each of which were manually reviewed to eliminate cassettes in which all candidate deletions clustered within a single array type suggestive of a batch artifact and noisy cassettes resulting from probes embedded within SDs (for examples, see Supplementary Figure 6). Subsequently, 19 cassettes were chosen for validation, manually divided into two qualitative tiers based on the totality of the evidence (follow-up potential of the affected gene, visual analysis of probe intensity distributions, etc.). We designed a custom NimbleGen oligonucleotide array, spanning each of the 19 genes and their flanks at very high density (Supplementary Note), and performed CGH on 98 samples, chosen by cassette score and availability and predicted to carry a deletion at one of the 19 genes.
We thank Niklas Krumm, Maika Malig, Laura Vives, and Jason Luu for assistance in validation experiments. We also thank Megan Dennis, Can Alkan, Emre Karakoc and Tonia Brown for useful discussions and editing the manuscript. B.P.C. is supported by a fellowship from the Canadian Institutes of Health Research. This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under awards 076113 and 085475. We would also like to thank Aaron Aragaki, Charles Kooperberg, and Rebecca Jackson for access to SNP data (FHCRC control dataset) generated as part of an ongoing Genome-wide Association Study to Identify Genetic Components of Hip Fracture in the Women's Health Initiative. This work was supported by NIH HD065285 to E.E.E. E.E.E. is an Investigator of the Howard Hughes Medical Institute.
All CNV calls have been submitted to dbVar under accession nstd54.
AUTHOR CONTRIBUTIONSThis study was designed by G.M.C., B.P.C., S.G., E.E.E., J.A.R., B.C.B., and L.G.S. L.G.S. supervised array-CGH experiments at Signature Genomics. J.A.R. and B.C.B. coordinated clinical data collection. G.M.C. and B.P.C. performed data analysis and curated control CNV data. SG curated genomic disorders data. S.G., T.V., and C.B. performed array CGH and PCR validations. C.W., H.S., R.H., V.H., H.A.H., P.B., E.M., D.N., K.L., H.T., M.H., N.A., J.G., J.K., V.S., K.J., and C.R. provided clinical information. G.M.C., B.P.C., S.G. and E.E.E. wrote the manuscript. All authors have read and approved the final version of the manuscript.
COMPETING FINANCIAL INTERESTS
E.E.E. is a member of the Scientific Advisory Board of Pacific Biosciences. J.A.R., B.C.B., and L.G.S are employees of PerkinElmer.