Studies of the distribution of de novo
copy number and sequence variants in ASD and control subjects invariably find elevated rates of damaging de novo events in ASD subjects 
. These studies also invariably find relatively little convergence of de novo
events on particular loci in the human genome. These results are consistent with only one conclusion about the genetic architecture of ASD, namely that there are hundreds of genes in the genome that can affect liability, possibly more. Indeed various statistical analyses of the data support this conclusion 
Another common theme of ASD studies is that while de novo
events are rare, they can successfully identify ASD liability genes, and in general the distribution of rare variation has been a key tool for gene discovery 
. By contrast common variation has not yet proven an effective tool for discovering replicable ASD genes, although there are tantalizing findings 
With these observations in mind the AASC has implemented a study of rare variation in ASD based on WES 
. Here we report on data from almost 2000 ASD subjects and controls. We find the distribution of rare variation between cases and controls is remarkably similar, showing that ASD risk genes cannot be identified in a case-control sample of this size. Indeed, even known ASD genes showed little association in this study. This finding is in keeping with other studies of rare variants, but with quite different phenotypes, supporting the conjecture that rare variant association studies require large samples 
. With respect to the genetics of ASD, the results are also consistent with the inference from de novo
studies that there must be hundreds genes affecting liability to ASD 
. These results underscore the scale of the challenges ahead in our effort to discover ASD genes. Large samples must be amassed and assessed and effective study designs implemented 
To gain insight into the limited power of this study, consider three scenarios: (A) the gene has 15 variants, each with MAF
, for which all have odds ratio of 4; (B) the gene has 20 variants, each with MAF
, for which 10 have odds ratio of 3; and (C) the gene has 40 variants, each with MAF
, for which 30 have odds ratio of 2. We list the required samples size of each scenario in Table S6
to achieve a power of 50% and 80% per gene (with a p-value threshold of
). Even though the power of mega-analysis is only 0.31, 0.11 and 0.06 for our study, assuming these scenarios were realistic, power would have been sufficient to discover a fraction of the large number of ASD genes present in the genome. We conclude that these scenarios do not describe likely models for risk genes in ASD.
As with GWAS, to assimilate large samples and gain power, multiple studies must be combined. In the analysis of samples from multiple studies, meta-analysis, based on Z-scores, has become the norm for most genetic investigations. This form of meta-analysis has power equal to mega-analysis for single variant tests 
, hence it is reasonable to assume that meta-analysis is generally superior to mega-analysis because the former more easily accommodates heterogeneity across studies. A notable result from our study is that these results do not carry over to gene-based tests such as SKAT. In that setting mega-analysis has considerably more power than meta-analysis because mega-analysis assesses the concordance of association for a variant across all sites and then combines information across all variants within a gene. In this way, the method separates true signals from false ones and attains a greater signal to noise ratio. In contrast, meta-analysis combines information across studies at the gene level and hence can not assess the pattern of signals at the variant level across sites.
A drawback of mega-analysis is that we encounter challenges when combining datasets collected across multiple studies, which can differ in many respects due to the use of different sequencing platforms and protocols. For instance, these differences lead to differential coverage by exon and different alignment errors. Even the best laboratory process has measurement error and these errors are exacerbated when they differ across batches of samples, particularly if they differ between cases and controls. For these reasons caution must be exercised if one is to reap the benefits of mega-analysis. Indeed, even after careful filtering, heterogeneity between sites could account for the modest inflation in the associate test statistics and the genomic control factor after combining sites via meta- and mega-analysis.
In this study we construct extra filters to ensure that the distribution of rare variation of the WES data is similar for the two centers. We find good results filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Ideally a filter is tuned by measuring some individuals on multiple platforms. We tune our filters using subjects measured twice. If such data are unavailable, however, we find that another promising approach is to compare minor allele counts (based on rare variants) per gene. A good filter is one that aims to equilibrate these quantities.
Even with the most minimal filtering we observe no excess of positive signals for association within the individual data sets, but for mega-analysis we observe a great number of positive associations. These false discoveries are diminished, however, after filtering. Likewise mega-analysis is more susceptible than meta-analysis to the impact of differential missingness across platforms and across case/control status. Indeed, without filtering, mega-analysis has many false discoveries but meta-analysis did not. However, using filtered data we find that mega-analysis is quite robust to differences in missingness rates across platforms and case/control status, although we recognize that this robustness could fail for more extreme heterogeneity of missingness. Still our study has some differences in missingness and yet does not produce detectable false discoveries. From our analyses we conjecture that filtering that removes variants with
missingness (per data set) is largely effective.
When combining data sets the effects of population substructure on association is also a concern due to clustering of rare variants in ancestry space 
. Even though our case-control samples are approximately pair-matched by ancestry in the study design, we find weak evidence of population structure confounding the test of association. In our data these effects could be mitigated by regressing out principal components of ancestry using common variants or low frequency variants. This result supports findings of 
, but is contrary to other predictions 
. Thus, although rare variants tend to be younger, and therefore distinctly clustered in populations, in our sample estimates of ancestry derived from common variants capture the major features of the distribution of rare variants in ancestry space.
In conclusion we find that WES data on nearly 2000 samples collected for a case-control study are insufficient to discover novel liability genes for ASD, even after applying efficient methods like mega-analysis and controlling for ancestry effectively. These results demonstrate that much larger samples will be required for effective gene discovery and lend further support to the prediction that there are hundreds of genes that impact ASD liability in the human genome.