|Home | About | Journals | Submit | Contact Us | Français|
While it is apparent that rare variation can play an important role in the genetic architecture of autism spectrum disorders (ASDs), the contribution of common variation to the risk of developing ASD is less clear. To produce a more comprehensive picture, we report Stage 2 of the Autism Genome Project genome-wide association study, adding 1301 ASD families and bringing the total to 2705 families analysed (Stages 1 and 2). In addition to evaluating the association of individual single nucleotide polymorphisms (SNPs), we also sought evidence that common variants, en masse, might affect the risk. Despite genotyping over a million SNPs covering the genome, no single SNP shows significant association with ASD or selected phenotypes at a genome-wide level. The SNP that achieves the smallest P-value from secondary analyses is rs1718101. It falls in CNTNAP2, a gene previously implicated in susceptibility for ASD. This SNP also shows modest association with age of word/phrase acquisition in ASD subjects, of interest because features of language development are also associated with other variation in CNTNAP2. In contrast, allele scores derived from the transmission of common alleles to Stage 1 cases significantly predict case status in the independent Stage 2 sample. Despite being significant, the variance explained by these allele scores was small (Vm< 1%). Based on results from individual SNPs and their en masse effect on risk, as inferred from the allele score results, it is reasonable to conclude that common variants affect the risk for ASD but their individual effects are modest.
Genetic analyses over the past two decades have connected a plethora of rare mutations with the risk for autism spectrum disorder (ASD) (1–9). The rate of discovery has accelerated remarkably with the use of genome-wide screening tools capable of identifying sub-microscopic structural DNA variation (10). Many of the identified risk variants occur de novo and implicate many genes. From the distribution of de novo copy number variants (CNVs) in affected individuals and their unaffected siblings, Sanders et al. (11) conclude that there are hundreds of ASD risk loci in the human genome. This magnitude for risk genes is also supported by the distribution of de novo sequence events in ASD probands (12–15). Indeed, over the past year evidence supporting rare variation affecting risk, both de novo and inherited, continues to accumulate through both copy number and sequence-based studies (11–20).
In contrast, genome-wide evidence for common variation affecting the risk for ASD is limited. In a recent study by this group, the Autism Genome Project (AGP), we reported on a genome-wide association study (GWAS) of almost 1400 strictly defined ASD families genotyped for one million single nucleotide polymorphisms (SNPs) (21) (herein referred to as Stage 1). We performed four primary association analyses based on ancestry and clinical thresholds and identified a single, marginally significant association for SNP rs4141463 located within MACROD2. Other large GWAS also highlighted significant associations at single loci, namely SNPs at 5p14.1 (22) and 5p15.2 (23). In our previous Stage 1 GWAS, we did not find evidence to support either of these earlier associations (21). Likewise, a subsequent study in an independent collection of 1170 individuals of European ancestry with ASD and 35307 non-ASD controls finds no support for the MACROD2 marker rs4141463 (24).
The objective of the AGP has been to characterize the genetic architecture of ASD, and therefore we sought to determine whether both common variants and rare CNVs account for risk and, if so, by how much. Thus far, we can conclude that rare variation plays a substantial role (9,25), complementing many other studies (26,27). What we can say about common variation is less definitive. Based on the analysis of published GWAS, Devlin et al. (28) concluded that common genomic variants having a substantial impact on risk (relative risk >1.5) do not exist; there could be a small number of common risk variants, as yet unidentified, having an intermediate impact on risk (relative risk 1.2–1.5); yet there could be many common variants of modest impact (relative risk <1.2). To consistently and reliably detect common variants of modest effect will require samples in the tens of thousands (29).
To produce a more comprehensive picture of how common variants affect the risk for ASD, we report the second stage of the AGP GWAS, amounting to an additional 1301 ASD families as a combined mega-analysis of Stage 1 and Stage 2. In addition to the single SNP analyses, we use the two-stage design to evaluate the SNPs, en masse, for evidence that a large number of common variants exert at least weak effects on risk. The allele-score method, as previously described by Purcell et al. (30), calculates an accumulated score for each individual based on the presence and count of associated alleles (for family-based association, the count is the over-transmitted allele). With their data, the International Schizophrenia Consortium found that those common variants assessed (M) could explain up to 3.18% of the additive genetic variance (VM) for schizophrenia (30). To estimate how much of the variance for ASD can be explained by common variants, the allele score derived from the Stage 1 data (21) was used to predict case status in the Stage 2 data and calculate the VM.
Properties of the Stage 1 sample are reported elsewhere (21). When combined with the Stage 2 data, we report the largest family-based sample of ASD to date (Table 1). Ancestry analysis, estimated from numerous genetic markers by using standard techniques (31–33) and mapped back onto European recruitment sites to anchor the common ancestry of subjects, showed that 94.9% of the probands are of European ancestry (Supplementary Material, Fig. S1). The primary analyses target four overlapping data sets, based on diagnosis (spectrum versus strict) and ancestry (only European versus any ancestry). Of these four samples, the most inclusive allows for a spectrum diagnosis for subjects of any ancestry (Table 1), whereas the least inclusive allows for only a strict diagnosis of autism for subjects of European ancestry (see Supplementary Material, Fig. S2 for experimental design).
Association analyses of these four classes do not identify any single SNP to be significantly associated with autism as judged by an accepted threshold for genome-wide significance, P < 5 × 10−8. Instead the distributions of test statistics are consistent with that expected by chance (Supplementary Material, Fig. S3). Only four genomic regions show association signals at P < 10−6, two falling in genes (Table 2; Supplementary Material, Figs S4 and 5). When we evaluated the corresponding Stage 2 results with SNPs highlighted by our Stage 1 study (21), only one shows evidence of a modest association, namely rs4150167, which is a non-synonymous variant in TAF1C (Spectrum, All ancestries; rs4150167-A; Stage 1 OR = 0.37, P= 7.87 × 10−7; Stage 2 OR = 0.65, P= 0.014).
Exploratory analyses also produce no significant findings after adjusting for multiple testing (Table 3). A detailed summary of these findings is given in the Supplementary Material, Table S1; Fig. S5). Results from Stage 2 analysis alone are presented in Supplementary Material, Table S2, but are not discussed in the main manuscript. Of the promising results from other GWAS studies (22,23,34), only rs1328244 (13q33.3) and rs6646569 (Xq24) described by Wang et al. (22) garner support from the combined AGP sample, although it is important to bear in mind that the studies overlap somewhat in samples/families and thus correlated statistics are expected (Table 4).
Under the hypothesis that individual common variants exert only a limited effect on risk and that many such variants are present and exert these effects independently, we would expect these effects to be detectable in an analysis of common variants en masse. To address this question, allele scores were generated for all AGP probands based on the four Stage 1 primary analyses with the goal of determining whether the score derived from the Stage 1 results predicted the Stage 2 case versus control status. Allele scores based on markers associated at 10 significance thresholds ranging from P< 0.5 to P< 0.00001 were evaluated. Used as a positive control for the method, the Stage 1 scores showed high to perfect predictive value for case status in the Stage 1 subjects (data not shown). When examined against Stage 2 individuals, the Stage 1 scores were significant predictors of case status (Fig. 1) and thus explain a significant portion of variance in case and control status of Stage 2 samples. In general, the variance explained increases with an increased number of markers in the model. Still the markers explain only a small proportion of the variance—always <1%, with the greatest signal observed in the smallest yet most homogeneous group, namely European ancestry individuals with a Strict diagnosis of autism (Vm = 0.78%; Empirical P< 0.001). Analyses stratifying by quintile of minor allele frequency showed that most of the variance explained accrues to the quintile 0.2–0.3 (Supplementary Material, Fig. S6). Although these results were noisy (Supplementary Material, Fig. S6), they suggest that many common variants affect risk.
Importantly the allele-score analysis assumes that the distribution of liability for pseudo controls is similar to that of the general population. Various factors could skew this distribution and thus bias downwards the estimated Vm, a notable one being an excess of multiplex families relative to population frequencies (35,36). Multiplex families comprise at least 33% of the families studied here (47.5% simplex, 19.5% unknown), and thus could substantially diminish the estimated Vm. In addition, as pointed out by Ripke et al. (37), the use of a larger sample size to develop the predictor (i.e. model from the Stage 1 sample) should produce a more accurate allele score and thus increase the amount of variance in the Stage 2 case and control status explained.
For this report the AGP genotyped 1301 additional families for almost one million SNPs, bringing the total analysed herein to 2678 families. From this large data set, we observed no genome-wide significant association with specific common variants. Four independent signals in the primary analyses yielded uncorrected P-values <1 × 10−6 and 23 independent signals in the exploratory analyses crossed this threshold (Tables 2 and 3). Some of these SNPs fall in intriguing genes, including TAF1C from the primary analyses and CNTNAP2 from the exploratory analyses. TAF1C encodes the TATA box-binding protein-associated factor RNA polymerase I subunit C. We previously highlighted and discussed TAF1C and the corresponding SNP (rs4150167) (21). Of interest is that the common allele for rs4150167 was over-transmitted and is represented on roughly 98% of the chromosomes. The same pattern, although not as extreme, is present in the Stage 2 sample and again cannot be traced to a genotyping error (Supplementary Material, Fig. S7). If this were a true causal effect, it suggests that the minor allele at rs4150167 is somehow protective. The protective allele results in a non-synonymous glycine to arginine change at amino acid position 523 and has also been implicated in alternate splicing (38), both of which offer a starting point for investigation of the impact of this variation on gene function. On the other hand the association should be viewed cautiously because the power of this study to detect a protective effect is small. Statistically, therefore, the posterior probability that this observation is a false positive is large.
The strongest uncorrected association observed across all analyses was for rs1718101 in European individuals with higher IQ (P-value = 7.8 × 10−9; OR = 2.13). This SNP is not in substantial linkage disequilibrium (LD) with other genotyped SNPs and thus no other SNP genotyped by this study supports its association. Examination of the genotype intensity plots, however, showed good quality clustering (Supplementary Material, Fig. S7). The SNP resides within CNTNAP2 (7q35), which encodes the Contactin-associated protein-like 2, a member of the neurexin-family thought to play a role in axonal differentiation and guidance. CNTNAP2 is one of the largest genes in the human genome encompassing ~1.5% of chromosome 7 and has previously been implicated in ASD, in part because it resides within regions of suggestive linkage (39). CNVs falling in other neurexin genes including NRXN1 have been implicated in ASD (4,8,9,25,40,41). More directly, rare variation in CNTNAP2, including CNVs, other structural disruptions and deleterious sequence variation, has been identified in subjects with autism, epilepsy, intellectual disability, Pitt-Hopkins-like syndrome, schizophrenia and Tourette syndrome (42–48), albeit with differing degrees of evidence for effect on these traits. Although the most compelling evidence is for recessively inherited Pitt-Hopkins-like syndrome, the other studies are more consistent with dominant or additive effects. Common variants in CNTNAP2 have also been associated with ASD (49–51) and specific-language impairment (52). In light of the literature, post hoc we identified the common SNPs reported in the early studies (49,50,52) to be associated with the risk for ASD or language impairment (Table 5) and evaluated their association to ASD and age-at-first-word and -phrase for all three inheritance patterns. Of the three SNPs reported in the literature, one (rs2710102) shows very modest association with the risk for ASD. For the language outcomes, rs1718101 shows modest, significant association with age-at-first-phrase, acting either additively or dominantly but not recessively; rs17236239 shows modest association with age-at-first-phrase under a recessive model (Table 5).
In our previous study, we highlighted association with SNPs within MACROD2. The most highly associated MACROD2 SNP from primary Stage 1 analyses, rs4141463 (P-value = 2.40 × 10−8; OR = 0.65), obtained by analyses of subjects with a strict diagnosis of autism and of European ancestry, showed little if any signal in the Stage 2 sample (P-value = 0.206, OR = 0.91). Thus, for the mega-analysis reported here, the association is less compelling (P-value = 1.2 × 10−6; OR = 0.77) than previously reported (21). Within this large gene three SNPs show substantial but still non-significant over-transmission of specific alleles from mothers (Table 3). The region encompassed by these three SNPs captures the 3′ region of a putative antisense RNA, specifically CR596518, and one SNP falls in its translated region (rs14135). Because antisense RNA can play a role in the mechanics of imprinting (53), this observation is of some note. Nonetheless, without additional biological or statistical evidence, the result is unconvincing. Thus, whether SNPs in MACROD2 or in its intra-genic antisense RNA genes play a role in the risk for ASD remains an open question. If they do, however, there can now be little doubt that their effect size is modest on the basis of our results and others (24).
That no individual common variants are significantly associated with the risk for ASD in our data was anticipated by earlier analyses (28), based on results from three GWAS studies and statistical theory regarding the relationships among sample size, effect size and power. These analyses predicted that few if any common variants have an impact on risk exceeding 1.2 (or below its inverse). Our results from >2700 trios, together with the results from other published GWAS (22,23,34), bear out this prediction. The analyses laid out in Devlin et al. (28) cannot be precise when predicting the number of loci with common variants of modest impact on risk (0.8< risk <1.2); the modelling is consistent with a range of loci from zero to many thousands.
Secondary approaches are being developed and applied to GWAS data to explore further the role of common variation within ASD. The AGP and others have applied diverse analytic approaches to determine whether there is evidence of enriched association in specific genes and groups of genes (54–57), whether there are regions of extended homozygosity within affected individuals that may implicate regions harbouring putatively recessive alleles (58) or to explore more discrete trait-based phenotypes within genome-wide data (59–61). To seek evidence for or against common variants having this modest impact on risk, we approached the problem by constructing an allele score, based on the transmission properties of SNP alleles in Stage 1 data, and asking whether these composite scores from putative risk alleles could predict case and pseudo-control status in Stage 2. In other words, do common variants over-transmitted in Stage 1 also tend to be observed in Stage 2? If so, this may provide evidence that common variants affect the risk for ASD. Moreover, when combined with the published GWAS results and the theory in Devlin et al. (28), they illuminate how common variants affect risk: individually they have very small effect, but en masse they exert a detectable impact. This logical circle is now closed. We find that allele scores derived in Stage 1 do indeed predict case and pseudo-control status in Stage 2, making the case that common variants affect the risk for ASD. The score cannot account for much of the variance, <1%, and only about a third of that recently explained for schizophrenia (30). Thus, while the existence of common variants affecting the risk of ASD is almost assured, their individual effects are modest and their collective effects could be smaller than that for rare variation.
On the other hand, complementing the allele-score analysis, gene-set enrichment and related analyses for AGP and other ASD data sets have found significant evidence for common variants affecting risk (54–57), suggesting that common variants account for a non-trivial proportion of risk and that many true positive associations of small effect could be buried in the noise of stochastic variation. Qualitatively, recent studies of rare variants also find it challenging to distinguish risk variation from the background stochastic noise (12–15). Given these challenges, it is reasonable to conjecture that even if all of the samples analysed here were also sequenced at the whole genome level, it would still be impossible to discern how much risk accrues from common versus rare variation, at least from current knowledge. Thus a precise estimate of the relative contribution of rare and common variation to risk will require further studies.
Participants in this AGP study were recruited at centres in North America and Europe. Subjects with known karyotypic abnormalities and fragile X mutations were typically excluded. Likely affected individuals were assessed using the Autism Diagnostic Interview-Revised (62) and Autism Diagnostic Observation Schedule (63). Cognitive functioning was established for the majority of subjects using a range of cognitive measures from which a categorical classification of intellectual capacity was derived for the analyses, as described below and elsewhere (21).
The Stage 1 sample was genotyped using the Illumina Infinium 1M-single SNP microarray; the Stage 2 sample was genotyped on either the Illumina Infinium 1M-single or the Illumina 1M-duo microarray. All quality control (QC) procedures were consistent across data sets and are described elsewhere (21). For subjects inferred to be of European ancestry, markers showing Fst >0.02 across recruitment sites were excluded from analyses. A total of 947 233 SNPs passed QC for the combined data sets.
Ancestry for both Stage 1 and Stage 2 samples was estimated by using 5176 SNPs, which had a genotype completion rate of ≥99.9%, and a minor allele frequency MAF ≥ 0.05 and were separated by at least 500 kb. The initial dacGem (33) analysis identified three dimensions of ancestry and separated the data into five ancestry groups (Supplementary Material, Fig. S1). Groups A through C, illustrated in Supplementary Material, Figure S1, delineate 1285, 771 and 486 families of European ancestry as determined by the representation of European recruitment sites. All other families were assigned to be non-European for purposes of association analysis (described below). By using genotypes from the same panel of 5176 SNPs, we searched the data set for duplicate samples between Stages 1 and 2, as well as within the Stages, and found none.
As described in Anney et al. (21), our primary analyses grouped families into two nested diagnostic classes: strict, in which affected individuals met both ADI-R and ADOS criteria for autism; and spectrum, combining individuals meeting strict criteria with individuals with a broader ASD diagnosis as determined by responses to the ADI-R and ADOS. As described above, we also grouped individuals on the basis of ancestry. To evaluate the effect of transmitted alleles on the risk of ASD, the primary analyses assume an additive model and evaluated four overlapping data sets: two levels of diagnosis (spectrum versus strict diagnosis) by two levels of ancestry (all ancestries versus European ancestry).
Secondary analyses (Supplementary Material, Fig. S2) were also performed based on parent-of-origin, verbal/ non-verbal status, IQ, age-at-first-words and age-at-first-phrases. The parent-of-origin analysis treated paternal- and maternal-specific transmissions separately for both strict and spectrum diagnostic classes. Stratification by verbal/non-verbal status as described in Liu et al. (64) was conducted in the total sample. IQ-based analyses included data from subjects whose IQ >80 (higher-IQ) and those with 25< IQ <70 (lower-IQ). IQ categorization was determined using available measures of verbal, non-verbal (performance) and full-scale IQ assessments. A score >80 for any of these three measures resulted in classification in the higher IQ group; otherwise, individuals classified into the lower IQ group had IQ estimates on at least two measures in the range 25–70. The age-at-first word and the age-at-first-phrase were analysed as quantitative variables (distributions shown in Supplementary Material, Figs S8 and 9).
Association analyses were performed as described elsewhere (20). Briefly, family-based analyses were performed using FBAT (65) and an in-house programme written for family-based association (66) that implements methods described by Cordell et al. (67). For parent-of-origin analyses, only the latter was used. A summary of the sample size for analyses involving Stage 2 and both Stages is given in Table 1.
Four allele scores were defined based on the GWA signals from the four primary association analyses from Stage 1 described above. For each SNP, the risk allele was defined as the over-transmitted allele. The SNP-specific component of the score was calculated as the dosage of the allele multiplied by the corresponding log of the odds ratio, and was computed for all autosomal SNPs. The individual-specific score was calculated by summing over the SNP-specific component values without mean imputation for missing data, using PLINK version 1.07 (68). Ten scores were created for each individual using association thresholds of the P-value: <0.00001, <0.0001, <0.001, <0.01, <0.05, <0.1, <0.2, <0.3, <0.4, <0.5). The allele-score approach requires controls as well as cases; to account for the family-based design, we generated a pseudo control from the untransmitted parental alleles using PLINK v1.07. For any particular SNP, ‘pseudo control’ is the term used for the three genotypes that could have been formed by the parental alleles, but were not; i.e. if parents have genotypes a1a2 and a3a4, and their offspring is a1a3, then the pseudo controls are a1a4, a2a3 and a2a4. These controls form the foundation for likelihood calculations for family-based association analysis.
Logistic regression of Stage 2 case status on mean score was performed, with covariates being the number of non-missing genotypes for all SNPs used to calculate the score and four dimensions of ancestry defined using PLINK v1.07. An approximation of the additive genetic variance explained by the observed markers was calculated using the difference in the Nagelkerke's pseudo R-squared from a model including the score and covariates versus a model including only the covariates. Empirical significance of these findings was estimated based on 1000 permutations through case–control randomization within each data set.
The authors gratefully acknowledge the families participating in the study.
Conflict of Interest statement. None declared.
The authors thank the main funders of the AGP: Autism Speaks (USA), the Health Research Board (HRB; Ireland; AUT/2006/1, AUT/2006/2, PD/2006/48), The Medical Research Council (MRC; UK), Genome Canada/Ontario Genomics Institute and the Hilibrand Foundation (USA). Additional support for individual groups was provided by the US National Institutes of Health (NIH grants: HD035469, HD055748, HD055751, HD055782, HD055784, MH52708, MH55284, MH057881, MH061009, MH06359, MH066673, MH080647, MH081754, MH66766, NS026630, NS042165, NS049261), the Canadian Institutes for Health Research (CIHR), Assistance Publique - Hôpitaux de Paris (France), Autism Speaks UK, Canada Foundation for Innovation/Ontario Innovation Trust, Deutsche Forschungsgemeinschaft (grant: Po 255/17-4) (Germany), EC Sixth FP AUTISM MOLGEN, The National Childrens Research Centre (Ireland), Fundação Calouste Gulbenkian (Portugal), Fondation de France, Fondation FondaMental (France), Fondation Orange (France), Fondation pour la Recherche Médicale (France), Fundação para a Ciência e Tecnologia (Portugal), the Hospital for Sick Children Foundation and University of Toronto (Canada), INSERM (France), Institut Pasteur (France), the Italian Ministry of Health (convention 181 of 19.10.2001), the John P Hussman Foundation (USA), McLaughlin Centre (Canada), Ontario Ministry of Research and Innovation (Canada), the Seaver Foundation (USA), the Swedish Science Council, The Centre for Applied Genomics (Canada), the Utah Autism Foundation (USA) and the Wellcome Trust core award 075491/Z/04 (UK). D.P. is supported by fellowships from the Royal Netherlands Academy of Arts and Sciences (TMF/DA/5801) and the Netherlands Organization for Scientific Research (Rubicon 825.06.031). S.W.S. holds the GlaxoSmithKline-CIHR Pathfinder Chair in Genetics and Genomics at the University of Toronto and the Hospital for Sick Children (Canada). Computational infrastructure for RJLA was supported by the Trinity Centre for High Performance Computing (http://www.tchpc.tcd.ie/) funded through grants from Science Foundation Ireland. Funding to pay the Open Access publication charges for this article was provided by Autism Speaks.