To characterize structural variation in ASD multiplex families and unrelated controls, we typed individuals at 561,466 SNP markers using Illumina HumanHap550 version 3 arrays. After excluding samples that failed to meet QC thresholds (see Table S1
), we obtained array data on 3832 individuals from 912 multiplex families enrolled in the Autism Genetic Resource Exchange (AGRE) 
, 1070 disease-free children from the Children's Hospital of Philadelphia (CHOP), and 418 neurologically normal adults and seniors from the National Institute of Neurological Disorders and Stroke (NINDS) control collection 
. Using the PennCNV software 
, we detected CNVs with a mean size of 59.9 Kb and mean frequency of 24.3 events per individual (see Table S2
). Sensitivity compares favorably with previous BAC array-based 
and SNP-based methods 
, in which mean resolution was observed to be in the range of Mbs and hundreds of Kbs, respectively.
As a first step towards validation of genotyping accuracy we examined the inheritance of CNVs in the AGRE cohort. Consistent with high quality, 96.2% of CNV calls made in children were also detected in a parent. To explore the issue of genotyping accuracy further, we generated CNV calls for an independently generated data set in which an overlapping set of 2,518 AGRE samples were genotyped using the Affymetrix 5.0 platform 
. For CNVs (>500 kb) in known ASD regions (e.g. 15q11–13, 16p11.2, and 22q11.21; ) 
, we observed 100% correspondence between the two platforms for individuals genotyped on both platforms. For further confirmation of CNV calls, we compared de novo
variants identified here to those highlighted in previous analyses of AGRE families. We identified all five de novo
CNVs reported by Sebat et al 
, three of the five de novo
CNVs reported by Szatmari et al 
, one de novo
CNV within A2BP1
reported by Martin et al 
, and all five 16p11.2 de novo
deletions reported by Weiss et al 
and Kumar et al 
. Of the two of thirteen de novo
CNVs reported by Szatmari et al
not detected as de novo
in our study, one was very small (2 SNPs, 180 bp on 8p23.2), and the second clearly appears to be inherited (469 SNPs, 1.4 Mb on 17p12). Thus, our data are concordant with several other studies, and provide a more comprehensive picture of de novo
CNVs in multiplex autism families. To further evaluate the quality of these data on another independent platform, we used Taqman to determine relative copy number at 12 previously unreported de novo
CNVs identified in AGRE probands, confirming 11/12 loci ( and Table S3
). Together these results suggest that the CNVs calls we report are consistent and reliable.
CNVs (>500 kb) on 16p11, 15q11–13, and 22q11 are present in a subset of AGRE families.
TaqMan experiments validate large de novo CNV calls.
We therefore undertook additional analyses to identify specific loci in which structural variants were enriched in cases versus controls. Because the majority of such variants were intronic or intergenic, we sought to prioritize CNVs most likely to interfere with the molecular function of specific genes. We first filtered CNV calls to include only exonic deletions (eDels) observed to overlap with a RefSeq gene. Overall, such eDels were observed at similar frequencies in AGRE cases, 1st
degree relatives of AGRE cases, and unrelated controls (CHOP and NINDS cohorts), with an average of ~2 such variants per person (Table S2
). To identify events related to the ASDs we then looked for genes harboring eDels in at least one case but no unrelated controls. Among the 284 genes that met this criteria (Table S4
) we observed several known ASD or mental retardation genes including: ASPM 
, DPP10 
, CNTNAP2 
, PCDH9 
, and NRXN1 
To enrich for genes most likely to contribute to ASD risk, we used family-based calling to evaluate which of these genes carried eDels in three or more cases from at least two unrelated families (Table S5
). This stringent filtering resulted in 72 genes at 55 loci, including NRXN1
. This is notable, given that eleven distinct disease-linked NRXN1
variants have been identified 
. Neurexin family members are known to interact functionally with ASD-related neuroligins 
, and likewise play an important role in synaptic specification and specialization 
. eDels in more recently identified candidates, including DPP10
, were likewise retained. Similarly, recovery of RNF133
within intron 2 of CADPS2 
highlights additional complexity at this locus. Although CNV breakpoints cannot be mapped precisely using SNP data alone, it is possible to determine overlap with protein coding exons and use these data to predict impact on gene function. Consistent with perturbation of function, distinct alleles at the loci highlighted here are predicted to eliminate or truncated the corresponding protein products ().
Rare exonic deletions (eDels) in NRXN1 and novel candidate genes alter predicted protein structures.
Importantly, CNVs at a majority of these eDel loci show unique breakpoints in different families and/or result in the loss of distinct exons, demonstrating that they are independent. Moreover, because it is well established that CNVs at a subset of loci show identical breakpoints in unrelated individuals 
, this result is likely to underestimate the extent to which variants described here arose independently. Results from multi-dimensional scaling are likewise consistent with the interpretation that variants we highlight arose independently (Figure S1
Given the large number of variants identified, it was critically important to confirm in an independent case-control analysis, how many of these eDels were truly overrepresented in cases, as opposed to being potentially attributable to Type I error. To address this concern, we sought to determine eDel frequency in these same genes in a replication dataset comprising 859 independently ascertained ASD cases and 1051 unrelated control subjects from the Autism Case Control cohort (ACC, see Description in Methods
). One third of the loci identified in the discovery phase were observed in one or more ACC controls (18/55; 32.7%), suggesting that while rare, eDels at these loci are not limited to ASD cases and family members. In contrast, and providing evidence for formal replication, 14 separate loci encompassing 22 genes were observed to carry eDels in both AGRE and ACC cases, but none of 2539 controls (Table S2
Our replication data lend strong support to the involvement of specific loci in the ASDs (). However, to ensure that these results were not observed by chance alone, we performed 10,000 permutation trials on data from the replication cohort by permuting case/control status across individuals. In each permuted dataset, we maintained the same numbers of cases and controls as in the original data, and calculated the number of genes harboring CNVs exclusively in cases. None of the 10,000 permutation trials gave results comparable to experimental observations for replicated case-specific loci (n
14; p<0.0001; ). In contrast, findings comparable to those for non-replicated loci (highlighted as case-specific in the discovery phase but subsequently seen in replication controls) were seen in controls in 246/10,000 trials (n
0.02; Figure S2
). Although additional experimental work in independent cohorts will be required to determine if variation in any of the genes highlighted here do in fact impact ASD risk, no more than 5 replicated loci would be predicted to be observed by chance alone.
Table 2 A subset of eDel loci were observed to harbor rare variants in both discovery and replication cohorts, but none of 2539 controls. eDel: exonic deletion; ACRD: autism chromosome rearrangement database (http://projects.tcag.ca/autism/).
Observed replication unlikely to be attributable to chance alone.
Despite the challenges associated with obtaining statistical support for individually rare events 
we next sought to assign P
values for replicated eDel loci. We were able to obtain support for each of the following loci: BZRAP1
at 17q22 (p
at 2p16.3 (p
at 14q21.3 (p
at 19q13 (p
), and a three gene locus at 15q11 (p
). CNV calls at each of 15q11 and 19p13 are highly-error prone, suggesting that results here be interpreted with caution (see footnotes C and F in ). Recovery of NRXN1
, however, provides confidence for involvement of additional loci that were likewise replicated. Benzodiazapine receptor (peripheral) associated protein 1 (BZRAP1
, alternatively referred to as RIMBP1
), is an adaptor molecule thought to regulate synaptic transmission by linking vesicular release machinery to voltage gated Ca2+ channels 
. Identification of this synaptic component here, in a hypothesis-free manner, is particularly satisfying and also provides additional support for synaptic dysfunction in the ASDs 
. Less is known about MDGA2 
, although comparison of the predicted protein to all others within GenBank by BLASTP indicated an unexpectedly high similarity to Contactin 4 (24% identity over more than 500 amino acids; Expect
). Given previous reports of hemizygous loss of CNTN4
in individuals with mental retardation 
and autism 
. similarity between MDGA2 and CNTN4, surpassed only by resemblance to MDGA1, is notable. Likewise intriguing in light of the suggestion that common variation in cell adhesion molecules may contribute to autism risk 
is the structural likeness of MDGA2 to members of this family of molecules.
Although some published analyses emphasize the greater contribution of gene deletion events in autism pathogenesis 
, there are also clear examples of duplications that strongly modulate ASD risk 
. We therefore conducted a parallel analysis of duplications, distinguishing between events involving entire genes (gDups) which might increase dosage and those restricted to internal exons (eDups) which could give rise to a frameshift or map to a chromosomal region distinct from the reference gene. For gDups, we identified 449 genes that were duplicated in at least one AGRE case but no CHOP/NINDS controls (Table S4
). Of those, 200 genes at an estimated 63 loci, including genes at 15q11.2 
, met the more stringent criteria of being present in three or more cases from at least two independent families (Table S5
). Of these, 11.5% (23/200) were also seen in ACC controls, whereas 24.5% (49/200) were case-specific in the replication cohort. Strong statistical support was obtained for established loci (e.g. p
and other genes in the PWS/AS region at 15q11–q13), and nominal evidence was observed for the following novel loci: CD8A
at 2p11.2 (p
at 4p16.3 (p
0.028), and CARD9/LOC728489
at 9q34.3 (p
For eDups, we reasoned that duplication of one or more internal exons could serve to disrupt the corresponding open reading frame and be predicted to impair gene function as a result. Despite the caveat that observed copy number gains need not map to the wild-type locus, known ASD genes including TSC2 
and RAI1 
within the Potocki-Lupski Syndrome critical interval were amongst the 159 loci observed in at least one AGRE case, but no CHOP/NINDS controls (Table S4
). Such events were also seen in one family at the NLGN1
locus, which is of interest given previous support for NLGN3
and NLGN4 
. Filtering of these results, using the more stringent criteria employed above in consideration of eDels, limited this set of events to 76 loci observed in at least three cases from two separate families (Table S5
). Interestingly, BZRAP1
, reported above to harbor eDels at significantly higher frequencies in AGRE and ACC cases versus controls (p
), was amongst these, with eDups observed here in four unrelated AGRE cases (screening p
0.021). Eight other genes, including the voltage gated potassium channel subunit KCNAB2
) remained absent from ACC controls and were also replicated in the independent case cohort. Although eDups at BZRAP1
were not detected in ACC cases, eDels at this locus were replicated, underscoring the importance of variation here. When considering eDels and eDups at the BZRAP1
locus together, the likelihood of such an observation occurring by chance alone is small (p
Although none of the variants we highlight were observed in any of 2539 unrelated controls, key events, including eDels at NRXN1
, and MDGA2
were observed in both cases and non-autistic family members (). This is in keeping with previous work which suggests that haploinsufficiency at NRXN1
may contribute to the ASDs 
, but is insufficient to cause disease. Such data are also consistent with the well established finding of the “broader autism phenotype”, such as subclinical language and social impairment in first degree relatives of cases with an ASD, which supports a multi-locus model 
. We were also surprised to see that key variants at these loci appear to be transmitted to only a subset of affected individuals in some families (). These observations parallel findings at other major effect loci including 16p11.2 
and DISC1 
and are consistent with a model in which multiple variants, common and rare, act in concert to shape clinical presentation 
. Results are also consistent with the idea that true risk loci are likely to show incomplete penetrance and imperfect segregation with disease 
, a reality that will complicate gene finding efforts. Related to this is that substantial effort will be required to determine whether rare alleles of moderate effect act independently on distinct aspects of disease (endophenotype model) or together to undermine key processes in brain development (threshold model). How distinct alleles may interact to shape presentation is yet another question that will require larger cohorts along with multigenerational families to resolve 
Exonic deletions, although enriched in cases versus controls, show imperfect segregation with disease in multiplex families.
By limiting CNV calls to include only exonic deletions (eDels) and duplications (eDups and gDups), we have attempted to enrich for variants most likely to impact gene function and in doing so improve the signal to noise ratio similar to work in other complex diseases 
. At the same time, like other gene-based strategies, we preserve our ability to consider eDels involving the same transcriptional unit as separate but equivalent. Given that such events appear rare, this is an important consideration.
Pathway analysis by DAVID 
found support for overrepresentation of cell adhesion molecules amongst recurrent eDel genes (uncorrected p
) , although it should be noted that this analysis does not adjust for gene size and may favor larger genes. Nevertheless, aside from SPON2
no eDels in these genes were observed in any of the controls interrogated. In contrast, no evidence for such overrepresentation was observed for genes in the ubiquitin degradation pathway and neither term was highlighted as overrepresented amongst eDups or gDups. Given that this study focused only on events encompassing RefSeq exons, differences from Glessner and colleagues 
are to be expected.
Despite the large cohorts interrogated at each phase of our investigations, only a minority of loci (established or novel) were replicated between AGRE and ACC cases. For example, variants at each of the following previously reported loci were observed multiple times in AGRE cases but not once amongst ACC probands: PCDH10 and DPP10 (eDels), RAI and TSC2 (eDups), and DIDO1 (gDups). This suggests that even with current numbers, the present experiments are underpowered to obtain replication for a subset of recurrent variants. Because events seen only in single cases collectively account for a substantial fraction of observed variation even larger cohorts still will be required for a thorough understanding of the genetic basis of complex disorders like the ASDs.
In summary, we have performed a high resolution genome-wide analysis to characterize the genomic landscape of copy number variation in ASDs. Through comparison of structural variation in 1,771 ASD cases and 2,539 controls and prioritization of events encompassing exons we identified more than 150 loci harboring rare variants in multiple probands but no control individuals. For each class of structural variant interrogated, the recovery of known loci serves to validate the methods employed and results obtained. Greatest confidence should be placed in loci harboring variants in multiple unrelated cases but no controls and also recovered in both screening and replication cohorts. Amongst novel genes, best support was obtained for BZRAP1 and MDGA2, intriguing candidate genes for which additional study is warranted.