|Home | About | Journals | Submit | Contact Us | Français|
Given prior evidence for the contribution of rare copy number variations (CNVs) to autism spectrum disorders (ASD), we studied these events in 4,457 individuals from 1,174 simplex families, composed of parents, a proband and, in most kindreds, an unaffected sibling. We find significant association of ASD with de novo duplications of 7q11.23, where the reciprocal deletion causes Williams-Beuren syndrome, featuring a highly social personality. We identify rare recurrent de novo CNVs at five additional regions including two novel ASD loci, 16p13.2 (including the genes USP7 and C16orf72) and Cadherin13, and implement a rigorous new approach to evaluating the statistical significance of these observations. Overall, we find large de novo CNVs carry substantial risk (OR=3.55; CI =2.16-7.46, p=6.9 × 10−6); estimate the presence of 130-234 distinct ASD-related CNV intervals across the genome; and, based on data from multiple studies, present compelling evidence for the association of rare de novo events at 7q11.23, 15q11.2-13.1, 16p11.2, and Neurexin1.
Autism spectrum disorders (ASD) are defined by impairments in reciprocal social interaction, communication, and the presence of stereotyped repetitive behaviors and/or highly restricted interests. A genetic contribution is well established from twin studies (Bailey et al., 1995; Lichtenstein et al., 2010) in which the very large difference between the monozygotic and dizygotic concordance rates is consistent with the contribution of de novo mutation and/or complex inheritance. In addition, the over-representation of ASD in monogenic developmental disorders (Klauck et al., 1997; Smalley et al., 1992), gene discovery in families with Mendelian forms of the syndrome (Morrow et al., 2008; Strauss et al., 2006), and long-standing evidence for an increased burden of gross chromosomal abnormalities (Bugge et al., 2000; Veenstra-Vanderweele et al., 2004; Vorstman et al., 2006; Wassink et al., 2001) all point to the importance of genetic risks.
Over the last several years, dramatic advances have emerged from the study of copy number variants (CNVs) such as submicroscopic chromosomal deletions and duplications (Iafrate et al., 2004; Sebat et al., 2004). Sebat et al. (2007) first noted that “large” (mean size of 2.3Mb), rare (<1% frequency in the general population), de novo events were more frequent in ASD probands from families with only a single affected child (i.e. simplex families) than in controls, as well as in comparison to probands from families with more than one affected individual (i.e. multiplex families).
The over-representation of large de novo CNVs in ASD has been replicated in studies ranging from 60 to 393 simplex trios (Itsara et al., 2010; Marshall et al., 2008; Pinto et al., 2010), and two of the three studies (Marshall et al., 2008; Pinto et al., 2010) have confirmed a greater abundance in simplex versus multiplex ASD families. The burden of rare de novo CNVs in simplex probands (i.e. the percentage of individuals carrying ≥1 rare events) has ranged from 5.0-11% (Table S1). Rare structural variants, both transmitted and de novo, also show varying degrees of evidence for association with ASD. These include deletions and/or duplications at specific loci, including: 1q21.1, 15q11.2-13.1, 15q13.2-13.3, 16p11.2, 17q12, and 22q11.2 as well as recurrent structural variations involving one or a small number of genes, including: Neurexin 1 (NRXN1), Contactin 4 (CNTN4), Neuroligin 1 (NLGN1), Astrotactin 2 (ASTN2), and the contiguous genes Patched Domain Containing 1 (PTCHD1) and DEAD box Protein 53 (DDX53) (Bucan et al., 2009; Glessner et al., 2009; Kumar et al., 2008; Marshall et al., 2008; Moreno-De-Luca et al., 2010; Noor et al., 2010; Pinto et al., 2010; Weiss et al., 2008).
To date, the number of definitive replicated findings from these studies has remained small and all evidence has pointed to a highly heterogeneous allelic architecture as no risk variant is present in more than ~1% of affected individuals. In addition, examples of incomplete penetrance (not all mutation carriers have disease), and affected siblings not sharing the same risk variant, have been the rule rather than the exception. Moreover, remarkably diverse outcomes have been identified for apparently identical CNVs. For example, chromosome 16p11.2 deletions and duplications have been found in individuals with ASD and intellectual disability (ID) (Weiss et al., 2008), seizure disorder (Mefford et al., 2009), obesity (Bochukova et al., 2010), macrocephaly, and schizophrenia (McCarthy et al., 2009). These complexities suggest that the use of association strategies to demonstrate an excess of specific de novo CNVs will play an important role in definitively implicating loci in ASD.
We have conducted a genome-wide analysis focusing primarily on the study of rare de novo CNVs in 4,457 individuals comprising 1,174 simplex ASD families from the Simons Simplex Collection (Fischbach and Lord, 2010), nearly three-fold larger than previously reported simplex cohorts. Each family is extensively phenotyped, with a single affected offspring, unaffected parents, and, in the majority of cases, at least one unaffected sibling. This ascertainment strategy is designed to enrich for rare de novo risk variants, the family-based case-control comparisons mitigate a wide range of technical and methodological confounders that have plagued association study designs (Altshuler et al., 2008), and we have developed a new rigorous approach to evaluating the genome-wide significance of recurrent rare de novo events. Consequently, the scale and design of this study provides an extraordinary opportunity to investigate the relative contributions of rare de novo and rare transmitted variants in simplex families, to identify novel ASD risk loci, to evaluate the relationship between rare structural variation and social and intellectual disability (ID), and to place these findings in the context of previous ASD data, particularly with regards rare de novo CNVs.
A total of 4,457 individuals from 1,174 families were included in the study. Data from 1,124 families passed all quality control; 872 families were quartets that included two unaffected parents, a proband, and one unaffected sibling; 252 families were trios that included two unaffected parents and a proband (Figure 1).
The male to female ratio for probands was 6.2:1. All had confirmed ASD diagnoses based on well-accepted research criteria (Risi et al., 2006), including autism: 1,006 (89.5%), Pervasive Developmental Disorder-Not Otherwise Specified: 96 (8.5%), and Asperger Syndrome: 22 (2%). The mean age at inclusion was 9.1 years for probands (4-18 years) and 10.0 years (3.5-26 years) for siblings. The mean (±95% CI) full-scale IQ in probands was 85.1±1.5, however the range was considerable (<20-167, Figure 3); the mean verbal IQ was 81.9±1.7 and the mean non-verbal IQ was 88.4±1.4. Self-reported ancestry was as follows: White, non-hispanic: 74.5%; Mixed: 9.3%; Asian: 4.3%; White, Hispanic: 4.0%; African American: 3.8%; Other: 4.2%. Additional phenotypic data may be found in recent publications (Fischbach and Lord, 2010) and at www.sfari.org/simons-simplex-collection.
DNA samples derived from whole blood (N=4,381), cell lines (N=68), or saliva (N=8) were genotyped on the Illumina IMv1 (334 families) or Illumina IMv3 Duo Bead-arrays (840 families), which share 1,040,853 probes in common. CNV prediction was performed by PennCNV (PN) (Wang et al., 2007), QuantiSNP (QT) (Colella et al., 2007), and GNOSIS (GN), (www.CNVision.org) (Figure 1). 115 predicted rare (≤50% of the span of the event found at >1% in the database of genomic variation (DGV)) CNVs were evaluated by quantitative polymerase chain reaction (qPCR). A higher positive predictive value was observed for CNVs predicted by PN and QT, with or without GN (PPV=97% with GN, PPV=83% without) than for other combinations of algorithms, irrespective of the number of probes mapping within the structural variation (Table S2, Figure S1); these “high-confidence” criteria were subsequently used to identify all rare transmitted CNVs.
However, given the likely importance of de novo variation, and the relative challenge of accurately detecting these CNVs (Lupski, 2007), we focused on this small subset of predictions and further optimized our detection strategy using the first 585 quartets with complete genotyping data (Figure 1). We predicted putative de novo events from the group of rare high-confidence CNVs based on the combination of within-family intensity and genotypic data and used a blinded qPCR confirmation process (Figure S1). 53% of de novo predictions based on ≥20 probes (N=94) were confirmed compared with 2.6% with <20 probes (N=430). 82% of failures were false-positive predictions in offspring, 18% were false-negatives in parents. The data from this experiment were used to refine de novo prediction thresholds (supplementary materials). In addition, given the large number of predictions of small CNVs, and the low yield of true positives in the pilot data set (Figure S1), we elected to restrict all further statistical analysis to rare de novo events encompassing ≥20 probes that were also confirmed by qPCR in whole-blood DNA (Figure S1).
At the conclusion of our study, we were able to evaluate this threshold further via a comparison of confirmed de novo CNVs identified here with those reported in 1,340 overlapping offspring (probands or siblings) using the Nimblegen 2.1M array, as described by Levy and colleagues in this issue (Levy et al., 2011). At a threshold defined by the presence of ≥20 Illumina probes within a genomic interval, a combined total of 58 rare de novo CNVs were identified across both studies, with each array type identifying 95% (n=55) of the total events. This suggests that both arrays have high sensitivity for such events at or above this threshold, and that the combined results are very likely to represent the complete set of large de novo CNVs present in the SSC. This situation is reversed below 20 probes: a total of 31 small rare de novo CNVs were identified between the two groups with approximately twice as many found using the 2.1M Nimblegen array vs. the IM Illumina array (23 CNVs vs. 12 CNVs respectively) as would be expected for the increased probe resolution. Of these 31 events only 13% (n=4) were identified by both groups, suggesting that the sensitivity for small de novo events was low for both arrays and that, as expected, there is a pool of small de novo structural events that were not considered in our analyses.
In light of strong prior evidence for an increased burden of de novo CNVS in simplex autism (Itsara et al., 2010; Marshall et al., 2008; Pinto et al., 2010; Sebat et al., 2007), we investigated these events in probands versus their unaffected siblings in all 872 quartets in this study (Figure 1). A total of 28,610 rare, high-confidence CNVs were identified; 97 were classified as rare and likely de novo, and 83 events were confirmed to be rare de novo CNVs by qPCR in whole-blood DNA (Table S4).
Rare de novo CNVs were significantly more common among probands than siblings. Overall 5.8% of probands (N=51 out of 872) had at least one rare de novo CNV compared with 1.7% of their unaffected siblings (N=15 out of 872) yielding an odds ratio (OR) of 3.55 (CI =2.16-7.46, p=6.9 × 10−6, Fisher's exact test) (Table 1 and Figure 2). When we considered the proportion of individuals carrying at least one rare de novo CNV that also contains known genes (genic CNVs), the OR increased to 4.02 (50 in probands vs. 12 in siblings; CI =1.98-6.36, p=2 × 10−6). These results remained consistent regardless of whether we analyzed total numbers of CNVs as opposed to the proportion of individuals with at least one (Figure 2), or increased the stringency of the threshold for “rarity” (supplementary materials).
Given the strong male predominance and increased rates of ASD in monogenic X-linked intellectual disability syndromes, we paid particular attention to rare de novo CNVs on the X chromosome but found only 2 events: one genic deletion present in a male at the gene DDX53 and a duplication involving 6 genes in a female sibling (Xq11.1). This small number precluded meaningful group comparisons. Importantly, neither these, nor any subsequent statistical results reported here were substantively altered by the exclusion of 15 confirmed rare de novo CNVs identified during our detection optimization experiments that did not then meet our minimum probe criteria (Table S4). Of note, however, one of these was an exonic deletion of NLGN3 on chromosome X in a male proband. (Table S4)
This burden of rare de novo CNVs in simplex families is remarkably similar to previously published results (Table S1) despite varying CNV discovery approaches and array densities from 85,000 (Sebat et al., 2007) to 1 million probes (Pinto et al., 2010). We reasoned that this was likely due to the particular importance of large de novo events, as their detection would be least sensitive to differences in probe number and distribution. Indeed, we found that rare de novo CNVs in probands tended to be larger than in siblings (mean 1.6Mb vs. 0.7Mb) (Figure 2, Figure S2) and to include a greater number of genes (16-fold increase in probands, and a 29-fold increase considering only deletions).
In fact, we found that de novo CNVs in probands were both larger and contained a greater number of genes when these measures were considered independently. We fit a series of stepwise linear models that increased in complexity from individual predictors to an analysis of covariance model, with size and affected status as predictors, to a three-term model that included the interaction of size and affected status. We confirmed a significant difference between probands and siblings with regard to the number of genes within CNVs (estimated β=11.1 more genes in a proband's de novo CNV; p=0.025) even after accounting for the strong effect of the size of the event (estimated β=6.8 genes per Mb; p=1.1 × 10−9) (Figure 3A). Considering deletions and duplications separately did not alter these findings. In summary, the burden of rare de novo CNVs is greater in probands with regard to number, size, and gene content.
Our interest in identifying specific regions of the genome contributing to ASD led us to next investigate whether multiple overlapping de novo events were present in probands and then to compare these findings to siblings. In total, 23 probands were found to carry recurrent de novo CNVs in 6 separate regions of the genome. Each of these intervals contains from 2 to 11 de novo CNVs in unrelated probands; no de novo CNVs overlapping these regions were found in siblings. In contrast only a single recurrent de novo event was observed in siblings (16p13.11 in 2 unrelated siblings) and one CNV overlapping the region was also found in a proband (Figure 4).
The 6 regions found in probands included 7 deletions and 4 duplications at chromosome 16p11.2, 4 duplications at 7q11.23 (the Williams-Beuren syndrome region), and 2 CNVs each at 1q21.1 (2 duplications), 15q13.2-q13.3 (1 deletion, 1 duplication), 16p13.2 (2 duplications), and disrupting the gene Cadherin 13 (CDH13) at 16q23.3 (5Mb deletion and an overlapping 34kb exonic deletion).
The presence of multiple regions showing overlapping rare de novo CNVs restricted to probands, and the absence of similar findings in their sibling controls, is striking. However, in contrast to genome-wide common variant association studies, there is no widely accepted statistical approach or threshold to formally evaluate these results. Consequently, we set out to develop a rigorous method to assess the genome-wide significance of de novo events (methods). To do so, we determined the null expectation for recurrent rare de novo CNVs based on our data from unaffected siblings and then used this expectation to evaluate the p-value for finding multiple recurrences in probands.
Using this approach, the probability of finding 2 rare de novo CNVs at the same position in probands is 0.53. However, the observation of 4 recurrent de novo duplications at 7q11.23 (p= 7 × 10−6) and 11 recurrent de novo CNVs at 16p11.2 (p= 6 × 10−23) are both highly significant. In addition, we found that 16p11.2 deletions (N=7; p=2 × 10−14) and duplications (N=4; p=7 × 10−6) are strongly associated with ASD when considered independently (Figure S3).
Prior studies have often found a combination of rare transmitted and de novo CNVs at ASD risk regions. In our data, we observed 8 loci at which rare transmitted CNVs, present only in probands, overlapped one of the 51 regions in probands containing at least one rare de novo CNV. Conversely, in siblings we did not observe any cases in which a rare transmitted CNV, restricted to siblings, overlapped one of the 16 regions showing de novo events. Interestingly, the 8 regions in probands showing overlapping rare de novo and rare transmitted CNVs include 5 of the 6 intervals with recurrent rare de novo variants, 1q21.1, 15q13.3, 16p13.2, 16p11.2, and 16q23.3 (Figure 4), and 3 additional genomic segments with 1 rare de novo event: 2p15, 6p11.2, and 17q12.
While the use of matched sibling controls should preclude any confound of population stratification, we explored whether genotype data from the parents of probands with 16p11.2 or 7q11.23 CNVs suggested unusual ancestral clustering (Crossett et al., 2010; Lee et al., 2009) pointing to a particular haplotype that might increase the frequency of de novo events. We found no evidence for this. In addition, given the very large number of 16p11.2 CNVs in this study and the widespread attention afforded previous findings at this locus, we considered the possibility of ascertainment bias. A review of medical histories obtained at the time of recruitment revealed that parents had prior knowledge of a 16p11.2 CNV in 2 instances (1 de novo duplication, 1 transmitted deletion). Nonetheless, with these events removed, association of both deletions and duplications remains significant (p=3 × 10−19 all de novo events (N=10); p=2 × 10−14 deletions (N=7); p=0.002 duplications (N=3)) (Figure S4).
The identification of multiple recurrent de novo events restricted to probands, and the absence of similar observations in siblings, led us to consider what these findings might indicate about the overall number of CNV-mediated ASD risk loci that are present in the genome. Consequently, we used the distribution of 67 de novo CNVs identified in SSC probands to calculate the number of regions likely to be contributing large rare de novo risk variants and estimated 130 loci (methods).
We then evaluated the implications of this likely genomic architecture for the planned second phase of genotyping and CNV analysis in the SSC, which is currently underway. We used the estimated number of predicted ASD loci to guide a simulation experiment (supplementary methods) and found that the most likely outcome of a second SSC cohort of similar composition and size to that reported here will be further confirmation of 7q11.23 and 16p11.2 and the identification of 2-3 additional regions of significant association. These were most likely to emerge at the intervals already identified containing recurrent de novo events in phase 1, namely 1q21.1, 15q13.2-13.3, 16p13.2, and the CDH13 locus.
Given highly reliable phenotypic data and long standing interest in the role of sex in ASD risk and resilience, we investigated whether males or females carried quantitatively different types of rare de novo events and what impact rare de novo CNVs had on intellectual and social functioning in both groups.
We found little evidence for larger or more gene rich de novo CNVs in males versus females. By fitting a series of stepwise linear models, we evaluated whether the number of genes within a de novo CNV tended to differ after accounting for a critical covariate, CNV size. Neither sex (p=0.20) nor the interaction of size and sex (p=0.06) was a significant predictor of gene number. These results should to be viewed with some caution, given the trend toward significance and a relatively small sample size (Figure 3B).
In contrast, we found that male intellectual functioning was relatively more vulnerable to the effects of rare de novo CNVs. Again using a series of stepwise linear models we evaluated the relationship between intellectual functioning, sex, and the number of genes within rare de novo CNVs. For males, there was a significant relationship between IQ and number of genes (p=0.02), with the model predicting a decrease of 0.42 IQ points for each additional gene. In contrast, for females the estimated effect was ten-fold less and did not approach significance (Figure 3D).
To evaluate whether low IQ predicted if a proband carried a de novo CNV, we fit a logistic regression model with de novo CNV status for probands as the outcome and full-scale IQ as the predictor. We found the accuracy of prediction was quite low (Nagelkerke pseudo R2=0.014). Overall, while the odds of carrying a de novo CNV varied three-fold for those with the lowest versus the highest IQ, the odds were never large (0.111 at IQ=30, 0.063 at IQ=80, and 0.036 at IQ=130). This relationship did not differ significantly by sex (interaction of IQ and sex, p=0.12).
Finally, we investigated the relationship between IQ, sex, and number of genes within rare de novo CNVs to determine if any of the models significantly predicted ASD severity (measured by the ADOS combined severity score (CSS)); of these only full-scale IQ predicted ASD severity (p=0.02).
Overall, the data show a strong effect of large rare genic de novo CNVs on affected status, but do not support either IQ or ASD severity as useful predictors for probands carrying large rare de novo risk variants in the SSC (Figure 3C). We did observe a trend toward more gene rich de novo CNVs in females and found females to be less vulnerable to the reduction in IQ associated with rare de novo CNVs.
We next investigate whether individuals with recurrent CNVs at 16p11.2 or 7q11.23 showed distinctive behavioral or cognitive profiles compared with probands who were not carrying rare de novo events. For each proband carrying a de novo CNV at 16p11.2 or 7q11.23, five other probands were selected as controls based on hierarchical matching criteria: first age, then sex, genetic distance, ascertainment site, and whether the sample was from a quartet or trio.
Our primary analysis focused on 4 variables: full-scale IQ, categorical diagnosis, severity of autism, and body mass index (BMI) (Table 2), with the latter motivated by multiple reports that 16p11.2 deletions (Bijlsma et al., 2009; Walters et al., 2010) contribute to obesity and the recent observation that duplications have the opposite impact on weight (Reymond et al., 2010). We then pursued a broader exploratory study of additional phenotypic variables, 10 of which are presented in Table 2 and the remainder in Table S5.
We found that probands carrying a 16p11.2 or 7q11.23 de novo CNV were indistinguishable from the larger group with regard to IQ, ASD severity, or categorical autism diagnosis (Table 2). However, we did find a relationship between body weight and 16p11.2 deletions and duplications. When we treated copy number as an ordinal variable (1, 2, and 3 copies), and used the matched controls as the diploid sample, BMI diminished as 16p11.2 copy number increased (estimated β=−3.1kg/m2 for each extra copy, p=0.02).
The extensive phenotypic data available on the SSC sample provides great potential to undertake fine-grained analyses of genotype-phenotype relationships. At present the limiting factor with regard to recurrent de novo CNVs is the small sample size, even for 16p11.2 duplications and deletions in this dataset. However, we undertook an exploratory analysis of a range of phenotypic features and found several that yielded significant p-values. While none would survive correction for multiple comparisons, we report them here (Table 2, Table S5) in the interest of generating hypotheses for future studies. For example, individuals with 16p11.2 duplications had higher hyperactivity scores, compared to matched control probands, while probands carrying 7q11.23 duplications showed significantly more behavioral problems (ABC total), but less severe social and communication impairment during ADOS administration.
Given the very strong association of rare de novo CNVs, we were somewhat surprised to find that rare transmitted CNVs were not present in greater numbers in affected individuals or in a greater proportion of probands versus siblings. As prior publications have shown an increased burden of specific subsets of CNVs in neuropsychiatric disorders including autism and schizophrenia, we considered multiple subcategories of rare transmitted events as well, including genic, exonic, brain-expressed, and ASD-related, and did not find a statistically significant result that survived correction for multiple comparisons (Figure 5).
These findings stood in contrast to a recent rigorous large-scale CNV study undertaken by the Autism Genome Project (AGP) (Pinto et al., 2010). Their sample included both simplex and multiplex families and identified a significantly higher burden of genic and ASD-related CNVs in cases versus unrelated controls. Notably they did not differentiate between transmitted and de novo events in this analysis. We reanalyzed our data using the identical criteria detailed in their manuscript and found similar results (Table S6). However, when we again restricted the analysis of our sample to rare transmitted CNVs, by removing confirmed rare de novo events; there was no significant difference found between proband and siblings, suggesting that the excess burden in the SSC sample was entirely driven by rare de novo events.
We pursued this analysis further given strong evidence that certain rare transmitted CNVs carry ASD risk, as well as reports of particularly significant effects for maternal transmission of rare CNVs to male probands (Zhao et al., 2007). Consequently, we investigated whether mothers were more likely than fathers to transmit a rare CNV to an affected offspring. We also asked whether there were a greater number of maternally transmitted CNVs in probands versus their unaffected siblings. Neither analysis showed a significant result after correction for multiple comparisons despite considering combinations of the following variables: deletions, duplications, size, exonic, brain-expressed, and ASD-related. In addition, based on the possibility that risk might be confined to only the rarest transmitted events, presumably under the strongest purifying selection, we evaluated “singleton” CNVs, i.e. those observed in only one parent and transmitted to only one proband or sibling. In this case, we found a modest, non-significant excess of maternally transmitted CNVs in probands: 344 maternal autosomal singletons are transmitted to probands vs. 303 transmissions to siblings (OR= 1.14; p=0.059 one-sided; p=0.12 two-sided). For fathers, there was no similar trend (OR= 1.03; p=0.37 one-sided).
We asked similar questions regarding transmission of rare X-linked CNVs from mothers to male probands and obtained similar results. In a group of 353 male probands and 353 matched male siblings, we found, contrary to expectations, that more siblings carried maternally transmitted rare CNVs than probands (14% probands vs. 18% siblings, OR=0.76, p=0.11), though this difference was not significant. The result did not change when we evaluated the various subcategories of rare X-linked CNVs including exonic, deletions, duplications, size, brain-expressed, or ASD-associated.
We hypothesized that the absence of evidence of association for rare transmitted CNVs might be a consequence of the inability to differentiate functional from neutral variants. Consequently we looked to pathway analyses to help address this question, reasoning that if the specific genic content of CNVs contributed to disease risk, we would find a greater enrichment of biological pathways in probands versus siblings.
We used two gene ontology and pathway analysis tools, MetaCore from GeneGo Inc. and DAVID (Dennis et al., 2003; Huang et al., 2009), to analyze 1,516 genes within CNVs exclusive to probands and 1,357 genes exclusive to siblings. The total number and size of rare, transmitted CNVs used to determine these gene sets were highly similar in probands and siblings (Figure 5). GeneGo Networks identified 22 pathways showing significant enrichment in probands versus only 4 enriched pathways among siblings. This difference was significant based on 100 permutations of the dataset (p=0.04). DAVID yielded consistent results with 59 pathways enriched in probands and 19 in siblings (p=0.01, permutation analysis) (Figure 6).
For the current analysis, we elected to restrict our evaluation of pathways to the general question described here. A manuscript describing a more extensive pathway analysis is in preparation focusing on both structural and gene expression data from the SSC.
We next examined all rare CNVs in the SSC in light of previously reported findings, comparing our data to the list of ASD regions included in the recent AGP analysis (Pinto et al., 2010). We also considered recent common variant findings, including SEMA5A (Weiss et al., 2009), MACROD2 (Anney et al., 2010), CDH9 and CDH10 (Wang et al., 2009), the MET oncogene (Campbell et al., 2006), EN2 (Gharani et al., 2004), and selected schizophrenia loci (ISC, 2008; McCarthy et al., 2009; Millar et al., 2000; Stefansson et al., 2008; Walsh et al., 2008; Xu et al., 2008) (Table 3). We identified multiple regions in which rare transmitted and/or rare de novo events corresponded to previously characterized loci in both ASD and schizophrenia.
Finally, we looked for evidence of association for any other CNVs in the SSC sample, evaluating all high-confidence autosomal CNVs together with all confirmed de novo CNVs. In this instance, we did not use a frequency cutoff to define a set of rare transmitted events. A total of 3,667 recurrent regions were identified; 6 showed relative enrichment in probands and 5 in siblings. No result reaches significance after correction for multiple comparisons (Table S7, Figure 7C). The region showing the greatest difference in probands compared to siblings was 16p11.2 (p=0.001).
Our approach to assessing the genome-wide significance of rare recurrent de novo CNVs allows for a statistical evaluation of events observed in cases without requiring additional matched control samples. Consequently, we were able to conduct a cumulative analysis across multiple studies in search of additional associated ASD loci. We included 4 other large-scale ASD CNV studies (Itsara et al., 2010; Marshall et al., 2008; Pinto et al., 2010; Sebat et al., 2007) meeting 4 criteria: standardized diagnosis, genome-wide detection, confirmed de novo structural variations, and sufficient information to permit the identification of duplicate samples.
These datasets catalogued 228 confirmed, rare de novo CNVs from a total of 3,816 individuals (Table S1). We found 6 regions that exceeded the threshold for significance (methods). Given prior evidence, and our own data, that reciprocal deletions and duplications at certain loci, both contribute to the ASD phenotype we evaluated significance for combined events at an interval, as well as calculating probabilities for deletions and duplications separately (Table 4, Figure S3).
The most frequent recurrent de novo CNV identified across all studies was 16p11.2 with 19 identified probands (14 deletions, 5 duplications) showing extremely strong evidence for association with ASD (2 × 10−55 combined; 5 × 10−29 for deletions; 2 × 10−5 for duplications). The proximal long arm of chromosome 15 showed two contiguous intervals: the first corresponds to the region 15q11.2-13.1 or BP2-BP3 (7 duplications; 4 × 10−9) (Figure 7A), long cited as the most common cytogenetic abnormality identified in idiopathic ASD (Cook et al., 1997). We also found evidence of association for the interval mapping to 15q13.2-13.3 or BP4-BP5 (5 duplications and 1 deletion; 1 × 10−4 combined, 2 × 10−5 for duplications) (Figure 7B). Rare deletions and duplications in this region have previously been associated with intellectual disability and ASD, and deletions with schizophrenia and epilepsy (Figure 7). It is important to note, however, that considering only events restricted to 15q13.2-13.3 (i.e. removing 3 overlapping isodicentric chromosome 15 events) we do not find significance (0.53 combined; 0.88 for duplications). This suggests either that the result is an incidental finding due to the proximity to a true ASD risk locus, or, alternatively, that the smaller 15q13.2-13.3 CNVs might point to a minimum region of overlap mapping to one or more ASD-related genes.
Recurrent de novo CNVs exceeding the significance threshold in the combined sample were also present at 7q11.23 (4 duplications; 0.003), 22q11.2 region (3 deletions and 2 duplications; 0.002 combined; 0.11 for deletions; 0.88 for duplications), and at the locus coding for the gene NRXN1. For NRXN1 there were 5 de novo events: 1 intronic deletion, 3 exonic deletions, and 1 exonic duplication (0.002 combined, 0.004 for deletions).
Finally, we used the observed number and distribution of de novo CNVs in the combined proband data set to estimate the likely number of CNV regions contributing to ASD. From the total of 219 confirmed de novo events, we derived an estimate of 234 distinct genomic regions contributing to large ASD-related de novo structural variations (methods).
Our results highlight the importance of rare CNVs for simplex ASD. We confirm an over-representation of rare de novo events in probands versus siblings with an odds ratio of 3.55 for all variants and 4.02 for rare de novo genic variants. Using a novel approach to assessing significance specifically for recurrent de novo CNVs, we find very strong evidence for the contribution of duplications at 7q11.23, showing, for the first time., genome-wide association in a case-control study. Moreover, we identify four additional rare recurrent de novo events restricted to probands. Two of these, 16p13.2 and the CDH13 locus, are novel ASD loci and two, 1q21, 15q13.2-13.3, have been previously implicated in neuro-developmental disorders including ASD. Each of these four regions also show rare transmitted CNVs exclusive to probands. Finally, we find compelling evidence confirming the association of both 16p11.2 duplications and deletions.
It is striking that while we replicate findings of elevated rates of rare de novo CNVs in simplex families (5.8% of probands versus 1.7% in siblings), the percentage of the cohort carrying these events is the same magnitude as that seen previously. This is despite the intensive focus on the ascertainment of simplex quartets and the 10-fold increase in probe density since the earliest studies of ASD. We believe these results are best explained by the particular contribution of large genic de novo variants given the results of our analysis of gene number, CNV size, and affected status (Figure 3), and the observation of generally consistent results over time despite steadily increasing detection resolution.
While it may not seem surprising that large de novo events carry the greatest risk for developmental disorders, it is interesting to note that we did not find evidence that ASD diagnosis or severity was mediated by intellectual disability (ID). It has been argued that ASD in the presence of ID may reflect an epiphenomenon, in which a non-specific impairment of brain functioning unmasks and/or exacerbates limitations in an individual's capacity for social reciprocity (Skuse, 2007). It has also been widely held that the detection of large de novo CNVs will be enhanced by the ascertainment of ASD samples with greater intellectual disability. Our data shows that large de novo CNVs confer substantial risk for ASD in the SSC, but they are only modestly correlated with lower IQ and largely independent of ASD severity.
These data suggest both that this study has identified bona fide high-risk variants for autism spectrum disorders and that many of these loci also confer liability to a range of complex neurobehavioral phenotypes. They also suggest a more complex relationship of IQ and large de novo events than is often supposed: for example the relatively high rates of 16p11.2 and 7q11.23 CNVs and low rates of 15q11.2-13.1 duplications seen in this study compared to others may reflect particular subpopulations of rare de novo risk CNVs that are more readily ascertained in cohorts with higher mean IQ.
The results further show that the risk associated with large de novo events is related to their greater genic content, even after controlling for larger size. This observation points to two countervailing hypotheses: first, that the greater gene number is a surrogate for the increased chance of disrupting one particular gene or regulatory region due to the involvement of a larger segment of the coding genome; or second, that it is the contribution of multiple genes and/or regulatory regions simultaneously within these CNVs that increases risk.
Our data do not allow us to resolve this issue. Nonetheless we suspect that if many deletions or duplications encompassing small numbers of genes were as highly penetrant as multigenic events, we would have begun to show more evidence for this either in the form of an overall increased burden for smaller de novo variations and/or association of specific de novo events. However, it is important to note that despite higher resolution than some prior studies, we nonetheless have a clear ascertainment bias for detection of larger CNVs. It is likely that the combination of high-throughput sequencing, larger patient cohorts, and increasingly sophisticated approaches to evaluating combinations of risk variants will begin to shed light on this issue, with regard to both sequence and structural variation.
Our findings with regard to recurrent de novo events in the SSC sample point to 6 putative ASD loci: two of these, 7q11.23 and 16p11.2, show clear evidence for genome-wide association. Moreover, our simulation analysis suggests that the most likely outcome of the ongoing Phase 2 SSC study will be confirmation of 2-3 of the remaining 4 intervals already showing recurrent de novo events, namely 1q21.1, 15q13.2-13.3, 16p13.2, and 16q23.3 (CDH13).
Our findings at 7q11.23 point to extraordinary opportunities to illuminate the molecular mechanisms of social development. Duplications in this interval have previously been described in developmental disorders, including ASD (Berg et al., 2007; Van der Aa et al., 2009), though these have been restricted to case reports or series, with the attendant difficulties in controlling for ascertainment bias. The identification of clear association of duplications in this controlled study of ASD is particularly striking given that the reciprocal deletion results in a developmental syndrome characterized in part by an empathic, gregarious, and highly social personality (Pober, 2010). Moreover, several lines of evidence, including atypical deletions (Antonell et al., 2010), mouse models (Fujiwara et al., 2006; Hoogenraad et al., 2002; Meng et al., 2002; Sakurai et al., 2010), and gene expression x phenotype studies (Gao et al., 2010; Korenberg et al., 2000) have already identified CAPGLY domain containing linker protein 2 (CLIP2), LIM domain kinase 1 (LIMK1), General transcription factor II, i (GTF2i), and Syntaxin 1A (STX1A) as the leading candidates among the 22 genes within the region for involvement in the cognitive and social phenotypes. The characterization of this single region in which opposite changes in copy number contribute to contrasting social phenotypes promises to set the stage for a range of interesting studies of the role of gene dosage within this interval and the genesis of social mechanisms.
The strong replication of findings at 16p11.2 also highlights emerging opportunities for translational neuroscience. Firstly, the region is sufficiently circumscribed to interrogate using molecular biological and model systems. Secondly, though we cannot quantify an odds ratio from our data, given the absence of events in siblings, there is clear evidence from this and prior studies (McCarthy et al., 2009) that 16p11.2 CNVs carry much larger effects than any common variant contributing to complex common disorders. Thirdly, the 1% allele frequency allows for prospective studies of natural history, neuroimaging, and treatment response, as, for example, in the recently launched Simons Variation in Individuals Project (https://sfari.org/simons-vip). Finally, the entirety of the data now strongly supports a role for both duplications and deletions of 16p11.2 in social disability. Together, these suggest that cross-disciplinary approaches can begin to address the means by which a single locus leads to a range of psychiatric outcomes previously conceptualized as distinct and to address the critical role that dosage sensitivity plays in the unfolding of these neuro-developmental phenotypes.
The notion that the 4 remaining recurrent de novo regions represent true ASD variants is supported by multiple lines of additional evidence. For example, they are among only 8 rare de novo CNVs that overlap with rare transmitted events restricted to probands. Moreover, 2 of the remaining 3 loci, 2p15 and 17q12, have been previously implicated in ASD (Liang et al., 2009; Moreno-De-Luca et al., 2010). In addition, rare 1q21.1 and 15q13.2-13.3 CNVs have been identified in developmental and neuropsychiatric syndromes, with deletions found in ASD (Miller et al., 2009; Shen et al., 2010), schizophrenia (ISC, 2008; Stefansson et al., 2008), idiopathic epilepsy (Helbig et al., 2009), and recurrent duplications reported here. CDH13 has not previously been noted to be a risk variant, but the larger family of proteins has been implicated in ASD pathogenesis through CNV studies (Glessner et al., 2009), homozygosity mapping (Morrow et al., 2008), common variant findings (Wang et al., 2009), and our pathway analysis (Figure 6). The 16p13.2 region contains four genes, the most immediately notable of which are C16orf72, coding for a protein of unknown function, recently identified in a schizophrenia CNV study (Levinson et al., 2011), and Ubiquitin Specific Peptidase 7 (USP7), which has been shown to have a role in oxidative stress response, histone modification and regulation of chromatin remodeling (Khoronenkova et al., 2010). Both would represent novel ASD risk genes, however for the latter, these biological processes, and the ubiquitin pathway in particular, have been previously implicated in ASD pathogenesis (Glessner et al., 2009).
The fact that the family-based design used in our study played a key role in allowing us to identify association presents an important contrast to the prevailing wisdom with regard to genome-wide association studies of common variants, in which there is a tendency to rely on unrelated case-control designs, given the relative ease of generating very large sample sizes. It is notable that the statistical power afforded by the low probability of observing multiple recurrent rare de novo events by chance more than compensated for the comparatively small sample reported here. This is particularly striking with respect to 16p11.2. Based on a traditional case-control comparison, the most significant finding in this sample, 14 events in probands and 0 in siblings (p=0.001, Fisher's exact test), did not provide evidence sufficient to withstand correction for multiple comparisons, while the analysis based on the null expectation for de novo recurrence clearly detected association. It is certain that the SSC sample ascertainment process enhanced certain findings and attenuated others. There is little question that restricting the comparison group to matched siblings limited power to identify association of specific rare recurrent transmitted events; our assessment of significance for de novo CNVs was based on conservative assumptions and may have excluded true risk loci; the filtering for rare de novo CNVs and the small sample size curtailed the assessment of multi-hit hypotheses; the generally older parental age may have obscured the relationship between age and de novo variation (Figure S3) and, as noted, poor specificity at the lower bound of detection limited our assessment of small de novo structural variations.
However, despite these limitations, the manner in which the design mitigated important confounds, and preserved sufficient power to detect association of recurrent de novo events, yielded clear benefits, unambiguously replicating prior findings and identifying novel risk loci. Moreover, this report considers less than half of the Simons Simplex Collection: phase 2 of this study is currently underway, as is high-throughput sequencing of the collection, also focusing on de novo events. Together these endeavors promise to further illuminate the genomic architecture of simplex autism and to provide additional critical points of traction for efforts toward elaborating the molecular mechanisms and developmental neurobiology underlying ASD.
All members of each family were analyzed on the same array version: either the Illumina 1Mv1 (334 families) or Illumina 1Mv3 Duo (840 families) Bead-array. These share 1,040,853 probes in common (representing 97% of probes on the IMv1 and 87% of probes on the 1Mv3). 824 or the 872 quartet families (94.5%) had all members hybridized and scanned simultaneously on the Illumina iScan in an effort to minimize batch effects and technical variation.
Genotyped samples were analyzed using Plink (Purcell et al., 2007) to identify incorrect sex, Mendelian inconsistencies, and cryptic relatedness by assessing inheritance-by-descent (IBD); 11 families were removed as a result.
CNV detection was performed using three algorithms: 1) PennCNV Revision 220, 2) QuantiSNP v1.1, and 3) GNOSIS. PennCNV and QuantiSNP are based on the Hidden Markov Model (HMM). GNOSIS uses a continuous distribution function (CDF) to fit the intensity values from the HapMap data and determine thresholds for significant points in the tails of the distribution that are used to detect copy number changes. Analysis and merging of CNV predictions was performed with CNVision (www.CNVision.org), an in-house script.
Specific genotyping and CNV parameters are detailed in the supplementary methods. 5% of the samples failed and were rerun; 39 families were removed due to repeated failures.
A CNV was classified as rare if ≤50% of its length overlapped regions present at >1% frequency in the Database of Genomic Variation (DGV) March 2010.
Burden analyses were performed on the matched set of 872 probands and siblings. Typically, three outcomes were assessed: proportion of individuals with ≥1 CNV matching the criteria (p-value calculated with Fisher's exact test); number of CNVs matching the criteria (p-value calculated with sign test); number of RefSeq genes within or overlapping CNVs matching the criteria (p-value calculated with Wilcoxon paired test). Where burden was assessed for unequal numbers of probands and siblings (e.g. by sex) the sign test and Wilcoxon paired test were replaced with the Wilcoxon test.
To determine the probability of finding multiple rare de novo CNVs at the same location in probands, we first estimated how many likely positions in the genome were contributing to the observed de novo CNVs in siblings. As there are widely varying mutation rates for structural variation across the genome (Fu et al., 2010), some positions are more likely to result in de novo CNVs observed in our sample than others. Consequently, the likely number of positions is much smaller than the total possible number of positions. We refer to the likely CNV regions as eCNVRs (effective copy number variable regions) and calculate their quantity “C” using the so-called “unseen species problem” which uses the frequency and number of observed CNV types (or species) to infer how many species are present in the population. Based on the observed de novo CNVs in the control sibling group, we apply the formula (Bunge and Fitzpatrick, 1993) C = c/u + g2*d*(1-u)/u, in which c = the total number of distinct species observed; c1= the number of singleton species; d = total number of CNVs observed; g = the coefficient of variation of the fractions of CNVs of each type, and u = 1 – c1/d. (In this calculation, due to the small number of observations, we assume that g equals 1.) For the de novo events in siblings, c1=14, c=15, d=16 and C=232. This calculation is performed in the siblings because the observed rare de novo CNVs in this group are assumed to be predominantly non-risk variants and consequently represent the null distribution.
Next, we calculate the chance that two de novo events match at any one of “C” eCNVRs in probands, using methods from the classic “birthday problem” which assess the likelihood of seeing at least one pair of matching birthdays among a given number of people. Our interest was in seeing >2 matches (m) in probands under the null hypothesis of no association with ASD. This calculation is performed empirically by distributing “d” events at random among “C” eCNVRs and then counting the maximum number of CNVs falling in the same location. Repeating this experiment many times, we obtained an estimate of the probability of finding “≥m” counts for ≥1 eCNVR under the null hypothesis.
Given the importance of the estimate of eCNVRs in unaffected populations for the determination of significance, we re-calculated “C” based on a combined set of confirmed de novo CNVs in controls described in the literature and obtained a highly similar result (C=242) (supplementary materials). Moreover, we determined that the results reported here remain significant under the plausible range of estimates for “C” (supplementary materials).
The unseen species problem was used to predict the total number of ASD risk-loci based on the distribution of de novo CNVs in probands. This required identification of the de novo CNVs that confer risk; to identify such CNVs we estimated that 75% of de novo CNVs in probands confer risk (67 de novo CNVs in probands – 16 de novo CNVs expected in siblings / 67 de novo CNVs in probands) and assumed that recurrent de novo CNVs were most likely to be associated with risk and should be included within this 75%. The remainder of the 75% is made up of 27 single occurrence de novo CNVs (though we do not identify which ones) leading to an estimate of the total number of risk conferring loci as 130 (c1=27, c=33, d=51). A similar approach was applied to all de novo CNVs in 3,816 probands (count derived from the literature), leading to an estimate of 234 risk conferring loci (c1=59, c=88, d=158).
Predictors were examined in a logical order, e.g. to evaluate the relationship between gene number (G), CNV size (L), and affection status (A, proband vs. sibling), we fit a series of increasingly complex linear models in the following steps: (1) regress response G on predictor L, regress G on A; 2) if ≥1 term was significant, and assuming L had the best predictive power, we regressed G on L and A; (3) assuming L and A were significant jointly, we regressed G on L, A and L*A (L interacting with A). The latter term permits the slope of the relationship between G and L to differ for probands vs. siblings. In each step, we determined if the newest term was significant, given the terms already in the model. We also fit the model using backward elimination, starting with the full model and simplifying it one term at a time.
All parents were projected onto a five-dimensional ancestry map using eigenvector decomposition (Crossett et al., 2010; Lee et al., 2009). Euclidean distances were measured for the parent-of-origin. The mean and median distances between these pairs of parents were calculated and were evaluated relative to the remainder of the sample using a bootstrap procedure (supplementary methods).
For each sample with a 16p11.2 deletion (8 samples) or duplication (6 samples) or 7q11.23 duplication (4 samples) 5 control probands were selected based on a matching hierarchy: age (100% of control probands matched), sex (100%), genetic distance (91%, based on five-dimensional ancestry map), collecting site (46%), and quartet/trio family (34%). Probands with de novo CNVs or CNVs in regions previously associated with ASD were removed prior to matching; each control proband was only included once.
For continuous variables each stratum of a “case” proband matched to 5 “control” probands was treated as a block and the data analyzed as a randomized block design by using analysis of covariance. Thus mean values were allowed to vary across blocks and to be altered by case-control status. The difference due to the presence of the CNV of interest was assessed with an F-test with N, M degrees-of-freedom (N is the number of CNVs of interest and M is the residual degrees-of-freedom after accounting for model terms). Because IQ is known to affect many behavioral measures associated with ASD, it was treated as a covariate in models for outcomes besides itself and Body Mass Index (BMI). For diagnostic status, matching was taken into account by using a conditional logit model.
We are most grateful to all of the families at the participating SFARI Simplex Collection (SSC) sites. This work was supported by a grant from the Simons Foundation (SFARI 124827). C.A.W. and R.P.L. are Investigators of the Howard Hughes Medical Institute. We wish to thank: the SSC principal investigators: (A. Beaudet, R. Bernier, J. Constantino, E. Cook, E. Fombonne, D. Geschwind, D. Grice, A. Klin, D. Ledbetter, C. Lord, C. Martin, D. Martin, R. Maxim, J. Miles, O. Ousley, B. Peterson, J. Piggot, C. Saulnier, M. State, W. Stone, J. Sutcliffe, C.A. Walsh, E. Wijsman); the coordinators and staff at the SSC sites; the SFARI staff (M. Greenup and S. Johnson); R. Smith and Z. Galfayan at Microangelo Associates for bioinformatics support; Prometheus Research; the Yale Center of Genomic Analysis (YCGA) staff, in particular S. Umlauf and C. Castaldi; T. Brooks-Boone and M. Wojciechowski for their help in administering the project at Yale; and J. Krystal, G.D. Fischbach, A. Packer, J. Spiro, and M. Benedetti for their suggestions throughout and very helpful comments during the preparation of this manuscript. Approved researchers can obtain the SSC population dataset described in this study by applying at https://base.sfari.org.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.