|Home | About | Journals | Submit | Contact Us | Français|
Both rare and common genetic variants underlie risk for almost any complex disease. Over the past few years a common tool for identifying common risk variants is genome-wide association or GWA. Our analyses focus on results from GWA targeting common variants affecting risk for autism spectrum disorders (ASD). Thus far three large GWA studies have been published, each of which highlights a single, non-overlapping risk locus. Evaluation of these studies suggests that combination of their data would diminish evidence for all of these loci, making none of them significant. Despite this paucity of findings, statistical theory can be used to infer a plausible distribution of effect sizes for SNPs affecting risk for ASD. We lay out this theory, calculate plausible distributions, and discuss the results in the context of results from GWA studies for schizophrenia.
Autism is a neurodevelopmental disorder manifesting in early childhood. Children diagnosed with autism show impairments in social communication and a pattern of repetitive behavior and restricted interests (Lord et al., 1994, 2000). About 15–20 per 10,000 children receive this diagnosis (Fombonne, 2009) and 60–100 per 10,000 receive the broad diagnosis of autism spectrum disorders or ASD (Fernell and Gillberg, 2010; Baron-Cohen et al., 2009). Family studies suggest that ASD is substantially heritable (Bailey et al., 1995; Hurley et al., 2007; Constantino and Todd, 2005). Thus far more of the genetics of risk traces to rare variation (Jamain et al., 2003; Durand et al., 2006; Sebat et al., 2007; Marshall et al., 2008; Weiss et al., 2008; Kumar et al., 2008; Glessner et al., 2009; Fernandez et al., 2010; Pinto et al., 2010).
Here we focus on genome-wide association (GWA) studies and what they tell us about the impact of single nucleotide polymorphisms (SNP) with common variation on risk for ASD. There have been three large studies to date, none of them completely independent (Table 1). Unfortunately each of the studies highlights a single, non-overlapping risk locus: 5p14.1, between the neuronal cadherin loci CDH9 and CDH10 (Wang et al., 2009); 5p15.2, between the semaphorin (SEMA5A) and bitter taste receptor (TAS2R1) genes (Weiss et al., 2009); and within MACROD2 at 20p12.1 (Anney et al., 2010).
In the sequel we will first review the GWA findings. Then we will use statistical theory to derive, based on the empirical data, a plausible distribution of effect sizes for SNPs affecting risk for ASD. These results are discussed in the context of results from GWA studies for schizophrenia.
This study first examined data from the Autism Genetic Research Exchange (AGRE), a sample of 780 ASDs families with 3101 subjects, and the Autism Case–Control cohort (ACC cohort), a sample of 1204 subjects with ASD and 6491 control subjects of genetically inferred European ancestry, for the purposes of discovering common genetic variants associated with ASD. Both cohorts were genotyped on the Illumina HumanHap550 BeadChip with over 550,000 SNPs. Each cohort was put through quality control and then examined separately, but no genome-wide associations were observed. However, when they were combined for a meta-analysis, one SNP located in the 5p14.1 region, rs4307059, was found to reach genome-wide significance at P=3.4×10−8, with five other SNPs in the same locus also showing P<1×10−4.
Findings were replicated using data from Collaborative Autism Project (CAP) cohort (Ma et al. 2009), which had 1390 subjects from 447 autism families genotyped on the Illumina HumanHap1M BeadChip of approximately one million SNPs, and the Center for Autism Research and Treatment (CART) cohort which consisted of 108 ASD cases with 540 genetically matched controls genotyped on the Illumina HumanCNV370 BeadChip over 300,000 SNP markers. The CAP cohort yielded P values ranging from 0.01 to 2.8×10−5 in the same direction of association, while the CART cohort showed that most of the SNPs yielded P<0.05 and maintained the same direction of association as well. P values from the combined analysis on all four datasets ranged from 7.4×10−8 to 2.1×10−10. The odds ratio associated with rs4307059 was estimated to be 1.19.
This study used families from two sources, the AGRE and National Institute of Mental Health (NIMH) databases. The AGRE sample included 780 multiplex autism families with ~3000 subjects genotyped on the Affymetrix 5.0 platform, which includes over 500,000 SNPs. The NIMH sample included 341 multiplex nuclear families with a total of 1233 subjects, genotyped on the Affymetrix 5.0 and 500k platforms over the same SNP markers genotyped in the AGRE sample.
Before combining the datasets, quality control filters were designed to identify robust SNPs. The combined dataset was comprised of 1031 nuclear families and a total of 1553 probands. The transmission disequilibrium test (TDT) for association analyses was used across all SNPs passing quality control restrictions in the complete family data set. Although no genome-wide significant associations were discovered, an excess of independent regions were observed at P<10−5 and P<10−4, which suggested that common variants in autism did exist but that the initial scan did not have sufficient power to identify them.
Case–control association analysis was also performed using 90 independent and unrelated cases with 1476 NIMH control samples, which were also genotyped on the Affymetrix 500K platform. The cases were originally excluded from TDT analysis because of missing parental data. These results, combined with the TDT results, generated eight SNPs in seven independent regions with association at P<10−5. In particular, SNPs falling between SEMA5A and TAS2R1 (5p15.2) were found to have association signal, with rs10513025 had an odds ratio of 0.55.
To verify the findings, replication was evaluated using two additional sets of autism family samples. The first consisted of 318 trios genotyped on the Affymetrix 5.0/500K arrays that were collected by investigators in the Autism Consortium and in Montreal, and the second included 1755 trios from independent Autism Genome Project (AGP) families, a set of Finnish families, and a set of Iranian trios. The same quality control and methods were applied to these data. In this independent replication attempt, only rs10513025 was found to retain P<0.01.
In combined analysis on both the scan and replication datasets, only rs10513025 met genome-wide significance criteria defined by linkage disequilibrium (LD) and permutation analyses (P<2.5×10−7). Imputation analysis was then used to increase coverage of the region to fill in missing genotypes and SNPs that had been excluded due to quality control restrictions. Several other promising SNPs appeared in these analyses which were confirmed after direct genotyping, in particular were noted rs10513026 at P-value 4.5×10−6 and rs16883317 at P-value 7.2×10−5.
This study reports results from the AGP, which represents many centers in North America and Europe. For the first stage of a two-stage GWA, 1369 ASD families comprising 1385 ASD probands were genotyped on the Illumina Human 1M-single Infinium BeadChip. Four principal analyses were performed, corresponding to data partitions along axes of diagnosis and ancestry: ASD versus autism and European versus all ancestries. Of the 842,348 SNPs passing quality control filters, the largest associations arose for the most homogeneous sample, namely autism diagnosis in families of European ancestry. Association results for a cluster of SNPs falling in a 300-kb intronic region of MACROD2 were noteworthy, with the strongest association occurring for rs4141463 (P= 2.1×10−8).
The AGP combined their data with data from independent simplex/multiplex ASD families in the AGRE database to perform a “mega-analysis” in 2179 families. (Note the AGRE data are a subset of the data used in the other two GWA.) This “mega-analysis,” performed on all markers, weakened the evidence for association. At rs4141463 the estimated odds ratio changed from 0.56 (95% CI, 0.47–0.67) to 0.65 (95% CI, 0.57–0.75) for the strict diagnosis, European ancestry, resulting in a P=4.7×10−8. The AGP also combined results from the family-based analysis with allele frequencies from 1842 controls from the Study on Addiction: Genetics and Environment (Bierut et al., 2010), also genotyped with the Illumina Human 1M-single Infinium BeadChip (22). Combining the AGP family-based transmission data with control data, using techniques reported in Crossett et al. (2010), yielded no new loci. When all three data sets were analyzed together, the significance level for rs4141463 was P=3.7×10−8 (OR= 0.73,95% CI, 0.66–0.82) for strict diagnosis, European ancestry.
Comparisons are challenging for Weiss et al. (2009) versus Wang et al. (2009). These studies overlap because both analyzed the AGRE sample; however, they genotyped this sample using quite different arrays, and only a small fraction of the SNPs are in common across arrays. SNPs showing GWA significance on the Affymetrix 5.0 array (Weiss et al., 2009) are not covered well—in terms of LD— by the SNPs genotyped on the Illumina HapMap550 array (Wang et al., 2009) and vice versa.
Comparisons of Anney et al. (2010) with the other two studies is more straightforward. Weiss et al. (2009) find association at three SNPs, rs10513025, rs10513026, rs16883317, with the strongest association at rs10513025. Its minor allele frequency is 0.04, and it is the minor allele that is under-transmitted to affected offspring. Neither this SNP nor the other two are genotyped by the Illumina 1M array. According to HapMap data, however, there is perfect LD between rs2234235, which was genotyped by the Illumina 1M array, and rs10513025. Unfortunately, there is no evidence for association in the data from Anney et al. (2010) for this proxy SNP. In fact the opposite allele—the more common allele—is slightly under-transmitted. Note that the pattern of transmission—minor allele is uncommon and under-transmitted—is cause for concern because statistical theory and empirical observation show this pattern can be generated by genotyping error (Yang et al., 2008). Weiss et al. (2009) did evaluate this issue carefully, including re-genotyping numerous samples by a different technology. Because samples assessed by different methods had identical genotypes, genotyping error appears to be an unlikely confounder.
The SNPs from the Wang et al. (2009) study, rs7704909, rs1896731, rs10038113, rs12518194, rs4307059, rs4327572, also do not show the same association for risk in the AGP data. The strongest association in the Wang study occurs at rs4307059. In the AGP data the P-value for association (biased transmission) at this SNP is 0.86, reflecting almost equal transmission of alleles. Results are similar for all six loci (range of P-values is 0.26–0.86); any allele showing modest over-transmission in the AGP data is opposite that reported by Wang et al. (2009). Note that a small subset of the AGP data overlaps with that of the Wang study, and its pattern of association in the AGP data agrees with those in Wang et al. (2009). Thus, genotyping error is an unlikely source of the difference between the two studies. It is important to note that most GWA studies report results from analyses using the additive model for mode of inheritance. However, most have also investigated models with recessive and dominant modes of inheritance, and these analyses have not resulted in additional loci.
It is worth noting that all three of these loci could either be false or true positives. If one were to combine the data for all studies, it appears that none of the three association statistic would cross the threshold for genome-wide significance. On the other hand, unbiased estimates of the effect sizes for these loci are modest: for the Wang and Anney studies, they are on the order of 1.1 to 1.2 for the odds ratio (or its inverse); for the Weiss study, the odds ratio is more extreme, 0.55, but power is low because the minor allele is so uncommon. (For family based association, only heterozygous parents are informative, and their frequency is a function of the minor allele frequency.) Even if the results from the original studies were true positives, the power to replicate findings is small unless sample sizes for replication are much larger than that for discovery. In addition, the bias inherent in the winner's curse could contribute to the lack of replication. Because GWA search over many loci and discovery datasets tend to be relatively small, this bias can generate substantially inflated estimates of risk. Because the actual value is much lower, it then requires even larger studies to replicate. Therefore failure to garner supportive evidence in relatively small samples—small in terms of power to detect such effects—is not surprising.
GWA studies have had varying levels of success in identifying susceptibility loci for complex traits. The typical effect sizes of these loci vary from moderate to weak, depending on the trait. Based on the distribution of the effect sizes of discovered susceptibility loci, Park et al. (2010) show how it is possible to predict the number and effect size of loci not yet identified. Using Crohn's disease and cancers of breast, prostate and colorectum, these authors estimate the sample size required to discover additional loci that could explain at least 15–20% of the heritability of these traits.
For ASD the story is more challenging. From the results described above it is reasonable to conclude that no common variants are convincingly associated with ASD at GWA significance for all reported data. The question arises, will larger studies yield signals? What is the plausible number of variants yet to be discovered and what effect sizes might they possess? Extending the techniques utilized by Park et al. (2010), we use statistical methods to answer these questions.
Assuming a particular range of effect size, we wish to estimate n, the number of susceptibility loci yet to be discovered. Traditional estimators for n, such as the maximum likelihood and the method of moments, both yield 0, which is not useful for predicting the outcome of larger studies of greater power that might be undertaken in the future. For a range of effect sizes, we solve for the median number of n, medn, and the largest plausible number, upn, of susceptibility loci consistent with observing no significant signals for a GWA study of a particular sample size. These two estimates correspond with the upper bounds of a 50% and 95% confidence interval for n, respectively. The former provides a useful point estimate for n and the latter a natural upper bound. Power to detect a signal in a study varies based on minor allele frequency f, log odds ratio β, targeted significant level (for our purposes, 5×10−8), and LD structure. Assuming perfect LD the effect size is es=2β2f(1 –f). The power is a direct function of es and the other variables given above. We can compute the power of a given study for any es and call it pow. Let us assume that there are n true effects with power pow. Our goal is first to determine how big n is likely to be, given that a study was conducted that yielded X=0 significant results.
We form an exact (1–α)·100% confidence interval for n, assuming we observe X=0 successes, when the success probability (or power) is pow. Clearly this interval includes 0, but what is the biggest plausible n, consistent with X=0? We find the upper limit for the confidence interval by finding the largest plausible n so that the probability of observing 0 significant hits is less than or equal to α:
Let us apply this theory to a sample of similar magnitude to the AGP + AGRE + SAGE controls, as described above (Anney et al., 2010). Results are given in Table 2. Based on the outcome of our calculations and the AGP study, it seems quite likely that there are no common SNP variants mapping onto an odds ratio of 1.5 or more. At best a handful of common SNPs mapping onto an odds ratio of 1.3 might still await discovery, and these SNPs are likely to be relatively uncommon. With a larger sample size (≥7500 cases) numerous SNPs of effect size in the range of 1.2 might be discovered: medn= 20, 5, and 2 for f=0.15, 0.25, and 0.5, respectively. Of course one cannot rule out discovering no such loci. With a much larger study (≥10,000 cases), many of the variants (medn=41, upn=177) corresponding to power=0.02, are likely to be discovered if they exist.
Given the relatively small samples analyzed for ASD GWA, it could be instructive to look at the results from GWA for schizophrenia. While there have been a substantial number of GWA studies for schizophrenia, two studies are large and noteworthy for our purposes, Stefansson et al. (2009) and Shi et al. (2009), which themselves are not independent. The Stefansson study was a two-stage GWA. In Stage 1 genotypes from 2663 individuals diagnosed with schizophrenia were contrasted with genotypes from 13,498 individuals taken as controls; in Stage 2 there were 4999 individuals diagnosed with schizophrenia and 15,555 controls. The study yields three significant findings, one locus involving the major histocompatibility (MHC) region, another intronic in TCF4 and the third upstream of NRGN. Odds ratios for these loci fall close to 1.2.
Results from the Shi study, of interest here, involve the meta-analysis of three large cohorts (including some of the samples reported in the Stefansson study), which in total yielded 8008 genotyped subjects diagnosed with schizophrenia and 19,077 controls. The Shi study found a single significant locus, specifically the MHC region, and the odds ratios of associated MHC SNPs were on the order of 1.15.
Let us use the results from the Shi study, the theory in Park et al. (2010) and what we have developed here to answer the following question. Given that you did find one variant with odds ratio 1.15 in a study of size 8000 cases and 20,000 controls, how many variants of various effect sizes are likely to be discovered in the future? Using odds ratio=1.15, f=0.15 and X=1, we obtain medn=1 and upn=5, suggesting that at most a handful of additional loci with comparable effect size remain undiscovered. But for a slightly smaller odds ratio (1.1), additional loci might yet be revealed: for X=0, medn=2, 4, or 17 and upn=7, 17 or 73, respectively, depending on allele frequencies f=0.5, 0.25 or 0.15. A few things are of interest from these calculations. The Stefansson et al. (2009) study is of similar sample size to the Shi study. While not perfectly independent, it is notable that this study finds two new loci in the range of odds ratio 1.15, in ballpark agreement with the prediction above. For a larger study, we would expect about 4 new SNPs to be discovered that map onto an odds ratio of 1.1, and more are possible for smaller allele frequencies. Finally, there are likely to be quite a few discoveries in the future if researchers continue to amass samples and assess common variant SNPs for association, but their effects sizes will be quite small. For these odds ratios of 1.15 and 1.1 and for allele frequency of 0.15, 80% power is achieved for roughly 10,500 and 23,000 cases and proportionately larger sets of controls. The way power works into the calculations is as follows: if the next sample had 16,000 cases and 40,000 controls, you would expect to discover about 80% of the loci with odds ratio=1.1 (i.e., 80% power to detect a locus of effect size 1.1.).
We posed a question in our title: “Do common variants play a role in risk for autism?” Based on our review of the genome-wide association studies for autism, the short answer is “we cannot be certain there is a role for common variants in autism risk.” There are, in our judgment, no definitive, replicated results. Our statistical analyses can put plausible bounds on what risk variation might exist, but the intervals always include zero, no common risk variants. On the other hand, we find zero implausible on the basis of comparisons to our analyses of schizophrenia GWA studies. Several common risk variants for schizophrenia have been discovered (Stefansson et al., 2009; Shi et al., 2009). Our analyses suggest more await discovery. ASD and schizophrenia show some striking overlap in rare risk variants (Cook and Scherer, 2008), and it will not be too surprising to find they share some common ones as well. More to the point, it will not be surprising at all if the two disorders are similar in the distribution of effect sizes, when samples evaluated for autism meet the current status for schizophrenia (about 18,000 cases in large-scale genetic analyses.)
Each of the three loci identified by large GWA studies has compelling features. For example, there is evidence the SNP identified by Anney et al. (2010) regulates expression of PLD2 (Duan et al., 2008), which is known to regulate axonal outgrowth (Kanaho et al., 2009) and metabotropic glutamate receptor signaling (Dhami and Ferguson, 2006), features of relevance to ASD.
Our statistical results for ASD were obtained using methods similar to those developed in Park et al. (2010); however the inferences require somewhat more speculation. Park et al. (2010) estimate the number of undiscovered variants, based on the nature of the variants already in hand. For ASD, no susceptibility loci have been reliably replicated over GWA studies. From this outcome and our calculations, we can conclude that there are not likely to be any discoveries of variants with moderate effect size. Nevertheless, based on our calculations, it seems plausible that many variants of weak effect could yet be discovered. Incremental increases in sample size will not reveal these effects. Based on our calculations it seems likely that tens of thousands of cases will be required to detect common SNP susceptibility variants causing ASD.
Some researchers would conclude, based on our calculations, that searching for common variants affecting risk for ASD is not worth the cost. We do not see it that way. It is reasonable to conclude that common variation dredged from simple GWA analyses will not explain a substantial fraction of the heritability of ASD. However, it is unreasonable to conclude much beyond that simple statement. It is entirely possible that findings from common variant studies will identify new pathways to risk or treatment, highlight genes also containing rare risk variants, and point to gene–gene interactions of significance to our understanding of ASD.
We gratefully acknowledge Su Hee Chu for her editorial work on the manuscript. This work was supported by grant MH057881 (BD, KR) and MH077930 (NM) from the National Institute of Mental Health.