To more comprehensively test the common variant hypothesis, we performed an unbiased genome-wide association study of common variation using as a discovery dataset the Caucasian autistic families from the Collaborative Autism Project (CAP). We validated our findings using an independent publicly available family-based Genome-Wide Association Study (GWAS) dataset from the Autism Genome Research Exchange (AGRE) (Autism Genetics Resource Exchange 2008
). Quality-control (QC) procedures were applied to the more than 1,000,000 single nucleotide polymorphisms (SNPs) in the discovery dataset and 550,000 SNPs in the validation dataset.
After applying QC filters, 775,311 common autosomal SNPs remained in the discovery dataset with an average genotyping rate of 99.80% and 500,100 common autosomal SNPs remained in the validation dataset with an average genotyping rate of 99.82%. To account for possible population stratification, we excluded families if the values for the top two principal components for either of the probands' parents were > 4 standard deviations from the core Caucasian cluster generated in EIGENSTRAT (Patterson, Price & Reich 2006
). The final datasets included 1,390 samples from 438 autistic families in the discovery dataset and 2,390 samples from 457 autistic families in the validation dataset. For any SNP of interest in the discovery dataset not directly genotyped in the validation dataset, imputation of genotypes was performed in the validation dataset using the program IMPUTE (Marchini et al. 2007
). The Pedigree Disequilibrium Test (PDT) (Martin et al. 2000
, Martin, Bass & Kaplan 2001
) was used for all association analyses. The distribution of p-values examined in the discovery dataset demonstrated a close match to that expected for a null distribution except at extreme tail of low p-values (). This is expected if there is little residual error in the data and common variants of modest effect sizes are acting in autism. In the discovery dataset, none of the p-values met the stringent and overly conservative Bonferroni correction for genome-wide significance ().
Quantile-Quantile (Q-Q) plot of PDT p-values for the discovery dataset
Genome-wide plot of association p-values in the discovery dataset
Examination of the 651 SNPs in the CNTNAP2
gene (Arking et al. 2008
, Bakkaloglu et al. 2008
) in our discovery dataset revealed only eight genotyped SNPs that were nominally significant (p-values=0.002-0.04). The results did not significantly improve in male only families (data not shown). The tagging SNP, rs270102
, reported by Alarcon et al. (Bakkaloglu et al. 2008
), was not significant in either the overall or male only family dataset. SNP rs7794745
showing linkage in Arking et al. (Arking et al. 2008
) study was not genotyped in our dataset. Association of imputed genotypes for this SNP was not significant (p=0.62). None of tested markers met gene-wide (CNTNAP2
) significance after correction (data not shown).
Despite no genome-wide significant association, 96 SNPs showed strongly suggestive association with autism risk (, p<0.0001) and met our initial criteria for follow-up. Among the 96 top hits, 2 SNPs, residing in 5p14.1, had improved p-values in the joint analysis and also had nominally significant association signal in the validation dataset encouraging us to look at this region in more detail. Therefore, we examined every SNP (n=46) genotyped in this region (25830kb to 26100kb) in both datasets regardless of their initial p-value. Analyses of these data revealed a cluster of 19 SNPs including 8 imputed SNPs showing nominally significant association (p<0.05) in the validation dataset (data not shown). Eight SNPs on chromosome 5p14.1 () showed improved association signals in the joint dataset. Risk was associated with the same allele for these eight SNPs in both datasets and the p-values became more significant (p-values: 3.24E-04 to 3.40E-06) in the joint analysis, with the most significant p-value coming from one of the top 96 hits rs10038113. The odds ratios for the major alleles ranged from 0.75 to 1.32 ().
Association statistics for validated SNPs on chromosome 5p 14.1:
To determine if we might miss a strong signal by only using the CAP dataset as the discovery dataset, we also reversed the datasets for discovery and validation and used our same two stage approach. 21 SNPs had p-values < 0.0001 in the AGRE dataset but none of them could be replicated in CAP dataset even with a nominal significance of p<0.05. We computed the power of the TDT in 438 triad families that approximates a lower bound for power of the PDT in our discovery sample. Given a prevalence of autism of 0.0066 (Chakrabarti, Fombonne 2005
), a SNP in LD (D'=1) with a risk allele frequency 0.6, we expect 84% power to detect an association at p
=0.0001 under a recessive model (GRRAA
=1) and 33% under additive model (GRRAA
=1.5). These are consistent with the allelic GRR's estimated for the chromosome 5 region. The power to detect a Bonferroni-corrected genome-wide significance (p
= 0.05 / 775311 SNPs = 6.4×10-8
) drops to 30% and 2.5%, respectively, for recessive and additive models.