|Home | About | Journals | Submit | Contact Us | Français|
Exome sequencing is emerging as a popular approach to study the effect of rare coding variants on complex phenotypes. The promise of exome sequencing is grounded in theoretical population genetics and in empirical successes of candidate gene sequencing studies. Many projects aimed at common diseases are underway, and their results are eagerly anticipated. In this Perspective, using exome sequencing data from 438 individuals, we discuss several aspects of exome sequencing studies that we view as particularly important. We review processing and quality control of raw sequence data, evaluate the statistical properties of exome sequencing studies, discuss rare variant burden tests to detect association to phenotypes, and demonstrate the importance of accounting for population stratification in the analysis of rare variants. We conclude that enthusiasm for exome sequencing studies of complex traits should be combined with the caution that thousands of samples may be required to reach sufficient statistical power.
Next-generation sequencing1–5 coupled with efficient DNA capture6–8 enable exome sequencing as a new approach to study the genetic basis of human phenotypes. A number of genes underlying Mendelian diseases have been mapped using this approach6, 9–15. Exome sequencing has also been applied to tumors16–20, where sample purity, read-mapping, and chromosomal rearrangements are critical and form a very distinctive set of issues. In this Perspective, we restrict our attention to complex traits. In complex trait genetics, exome sequencing studies bring to light rare coding variants that are undetected by microarray-based genome-wide association studies (GWAS). The promise of exome sequencing studies of complex traits is based on the success of candidate gene studies21–26 and has firm roots in population genetic theory27–35.
Large-scale GWAS of complex traits have consistently demonstrated that, with few exceptions, common variants have modest effects, often requiring tens of thousands of samples for their detection. Exome sequencing provides a complementary approach by comprehensively assessing the role of all coding variation, both common and rare. With incessant mutations occurring in each protein-coding gene (at a rate of ~10−5 per gene for non-synonymous variants36–39) and fitness loss of less than 1% 29–31, 34 for most novel non-synonymous mutations, almost every gene is expected to harbor functionally important variants that can be tested through sequencing, even if these variants are rare. Therefore, the strong interest in exome sequencing stems from three factors: the potential to identify many genes underlying complex traits, straightforward functional annotation of coding variation, and cost being substantially lower (around 5 times) than whole-genome sequencing.
In this Perspective, we evaluate the extent of rare coding variation in empirical data, discuss data processing and quality control of raw sequence data, review analytical methods for detecting genotype-phenotype associations, their expected statistical power, and the potential for confounding due to population stratification. To illustrate our arguments, we used empirical whole-exome sequence data from 184 individuals from the International HIV Controllers Study40 and 254 control individuals from Schizophrenia (SCZ) exome sequencing study.
Exome sequencing data contain an abundance of rare coding variation and indicate that a large fraction of this variation is functional. Not only are there many more rare variants than common ones, but sequencing additional samples continues to uncover additional rare variants. In fact, as sample size increases, the number of observed variants grows much faster than predicted by the neutral model of constant population size41, 42 (Figure 1). This relative excess of rare variants can be, in part, attributed to recent population expansion43–45, but is also likely due to purifying selection. As a consequence, rare variation is enriched for evolutionarily deleterious, and thus functional, variants. Additionally, the proportion of non-synonymous variants is higher among rare than among common variants45. Finally, among rare variants, missense variants predicted46 to be damaging are more prevalent than variants predicted to be benign (Figure 1). These findings are consistent with studies that demonstrated that rare variants in protein-coding regions are under purifying selection35, 47–51. Because sequencing larger samples continuously uncovers functionally relevant variants, exome sequencing studies enable direct identification of causal variants (in contrast to GWAS that use linkage-disequilibrium patterns between common markers).
An exome sequencing study starts with exome capture and sequencing of DNA samples followed by the identification of sequence variants. Exome capture may be realized on many platforms (e.g., Illumina HiSeq, Roche 454, ABI SOLiD) and through a variety of probe definitions (e.g., Agilent SureSelect, Nimblegen SeqCap EZ). Recent advances have enabled sequencing an entire exome or even several exomes at deep coverage in a single run of the sequencing instrument. However, exome capture technologies differ in what they target, how much they capture, and how consistently they do so8. Moreover, only 80-90% of the targeted regions are covered above 10×, which may leave 4–8Mb (or 1000–2000 genes) without sufficient coverage for variant detection.
Exome sequencing coverage has tremendous regional variation8. Some regions may be over-covered, representing true structural variation (e.g., segmental duplications for which only one copy of the region exists in the reference genome), or technical artifacts (e.g., greater abundance of capture probes, or overlapping probe definitions resulting in “double-capturing”). Similarly, some areas may be under-covered for biological reasons (e.g., segmental duplications where more than one copy exists in the reference sequence, preventing the aligner from placing the read uniquely) or for technical reasons (e.g., high GC content or density of variation, which impairs hybridization of probes). Furthermore, some “near-target” regions within 50 bp of the target boundary can have sufficient coverage to warrant inclusion in variant calling. Critically, whichever capture technology is used, either all samples should be processed using the same technology or the variability should be accounted for, e.g., by stratifying the study by technology (see section on population stratification).
For this Perspective, we generated whole-exome data (targeting 28 Mbp) from 438 samples (see Methods section) using a two-stage approach52, 53. First, we applied the data processing and variant calling protocol described previously54. Second, we applied post-SNP-calling Quality Control (QC) filters.
For quality control of the resulting SNPs, we used population-genetic statistics and properties of human genetic variation. Using those statistics helps to identify true variants because the properties of the mutational process37, 55 are different from errors of the sequencing technology.
We compared statistics computed in the 438-sample data set and in 37 whole-genome data sets released by Complete Genomics Inc. (CGI, see URLs), focusing only on the same genomic regions as the exome data. The CGI whole-genome dataset serves as a good comparison because whole-genome sequencing is not dependent on exome-capture technology. We further stratified these per-sample statistics into classes that are biologically interesting (functional class and CpG status) but may also exhibit different rates of technical artifacts. Table 1 shows that filtering is critical for achieving high-quality calls. Before filtering, the metrics show significant deviations from their expected values, which may indicate a high false-positive rate. After filtering, the statistics converge to those in the CGI dataset. The effectiveness of the filters is evident also from the comparison with human-chimpanzee divergence55.
The number of novel variant sites (defined here as not present in dbSNP 129) is another metric of SNP-call quality (Supplementary Table 1). Most novel variants have low frequency, and are especially enriched in the singletons and doubletons. Singletons and doubletons are particularly important to distinguish from false positives because technical artifacts or errors in data processing can easily manifest themselves as novel variation.
Statistics such as transition/transversion ratio (Ti/Tv) and the number of novel variants are useful as gross guides to the quality of the dataset and enable comparison of two sets of calls from the same dataset. However, precise expectations of these statistics are unknown because they depend on many factors, including uneven coverage, variability in DNA quality, or other sources of technical bias such as machine error. Therefore, interpreting small differences from expectation in these statistics is nontrivial. Genotyping validation provides an additional measure of callset quality, independent of the population-genetics statistics. Comparing genotyping data to sequencing data enables directly measuring callset quality by calculating the non-reference sensitivity (“NRS” — the rate at which non-reference sites in the genotyping data are recovered in the sequencing data) and non-reference discrepancy rate (“NRD” — the rate at which genotypes from sequencing and genotyping data differ). A genotyping assay should include sites at various allele frequencies, especially low frequencies (~1%). When available, family data, particularly trios, can also be useful to assess callset quality.
Comparing our callset with GWAS data in the same samples at overlapping sites suggested high sensitivity for common variants (98.6% NRS). To assess quality of low frequency variant calls in a comparable sequencing dataset we compared CGI data to Omni chip from the 1000 Genomes project (Supplementary Table 2). This comparison resulted in 95.65% NRS, 1.79% NRD and 1.12% NRD for novel variants.
Despite stringent QC, genotyping and sequencing errors are still present. Unfortunately, when stratifying variants based on putative functional consequences, the class of variation that is annotated to be most deleterious is also more heavily enriched for errors56. This underscores the importance of rigorous quality control.
It is critical that great care is taken to prevent technical biases and confounding in sequencing to avoid distorting association results. For instance, differences in how (rare and precious) case samples are handled compared to control samples may lead to systematic false-positives that masquerade as interesting associations. Likewise, simultaneous multi-sample variant calling on only cases or only controls may lead to differential detection of variants across batches, negatively impacting the accuracy of the allele frequency estimates and association analyses. Many other, often poorly understood or hidden, technical confounders (e.g., DNA preparation, exome-capture technology, machine type, read length, depth of coverage, SNP calling algorithm, QC filters) may influence the properties of exome-sequencing data. Therefore, although the use of shared controls (e.g., from the 1000 Genomes project) has been helpful in “filtering” approaches applied to Mendelian disorders6, 9, it is not likely to be applicable to association analysis of complex diseases.
Analysis of rare variants requires statistical methods that are fundamentally different from association statistics used for testing common variants. There are two reasons for this. First, rare variants have to be combined in a gene (or pathway) for an association test to reach sufficient power57. For example, a causal SNP at a frequency of 1 in 500 and genotype relative risk of 10 in a sample of 200 cases and 200 controls, has 0.2% power to be detected at a conventional significance threshold for GWAS (P < 5 × 10−8). Second, functional and population genetics information can be added to the testing approach because exome sequencing comprehensively captures variation that can be annotated with such information.
Early candidate-gene sequencing studies for complex traits were based on the comparison of numbers of non-synonymous alleles exclusive to cases or controls (or samples at the extremes of the trait distribution)21, 26. This approach has limited power because it ignores common and low-frequency polymorphisms, as most such variants would be present in cases and controls. Recently, a number of statistical tests have been designed for rare-variant analysis. The Combined Multivariate and Collapsing (CMC) test58 jointly assesses the role of rare and common variation. For the common variants, traditional regression-based association is applied. For rare variation, an individual's predictor in a regression model is defined as 1 if the individual possesses at least one rare variant in the region (e.g., gene) and 0 otherwise. The Weighted-Sum Statistic (WSS) test59, creates a composite genotype score for all individuals. This score is the sum of alternate alleles weighted by the inverse of the binomial variance. A rank sum test is then performed on the genotype scores between phenotypic groups. The Kernel-Based Adaptive Cluster (KBAC) test60 also uses a weighting scheme that reflects apparent effect sizes of individual variants. An alternative approach to combine rare variants into a single test is to select an allele frequency threshold based on the observed data. This variable threshold (VT) approach61 was motivated by population genetic simulations that showed that there is no single optimal weighting scheme or allele frequency threshold. There are numerous other statistical tests for rare variants in complex traits (reviewed in refs.62–65).
In simulation studies64, most tests behave similarly in many situations. However, the results may depend on assumptions used in simulated data. The relative power to detect association depends on factors such as the number and proportion of causal variants, their population frequency, and their effect sizes, as well as directionality of effects, the number of genes contributing to the trait, and fraction of causal genetic variation located in the exome. Statistical tests were developed with various combinations of these factors in mind and therefore are likely to be sensitive to different disease architectures. For example, the simulation framework used in development of the WSS test assumes effect size proportional to 1/x(1 − x) (where x is the population frequency of the causal allele), while Sequence Kernel Association Test (SKAT)66 simulation framework uses effect size proportional to −log(x), and the VT test simulations uses a demographic history model with a range of possible values of strength of selection leading to different relationships between effect size and x. These simulations were designed to demonstrate the strengths of each methods under different effect-size distributions: the WSS is designed for effect sizes proportional to 1/x(1 − x), SKAT is designed for effect sizes proportional to beta density β(x; a1, a2) for prespecified a1 and a2, the C-alpha67 test is designed for effects going in opposite directions in the same region, while the VT test makes no assumptions about effect-size distributions.
When combining rare variants, all functional variants may either be assumed to influence the trait in the same direction, or some may be allowed to have opposite directions of effects. A biochemical argument can be made that most of non-synonymous variants are loss-of-function hypomorphs, while gain-of-function variants are infrequent. However, some genes (e.g., PCSK968) have variants of both kinds. Several tests allow for rare variants to have opposite effects on the trait (e.g., Step-up69, C-alpha, replication-based test70, SKAT). These tests are based either on the analysis of over-dispersion or on explicit linear models that determine the contribution of a variant to a score based on the direction of effect observed in the data.
Rare-variant tests can benefit from stratifying or weighting rare alleles by functional significance, as evidenced by simulations and sequencing studies of candidate genes61,64, 71–73. The power of rare-variant tests is strongly influenced by the fraction of causal variants among all variants analyzed and using functional information is an effective way to give greater weight to likely causal variants. For example, nonsense variants should be prioritized higher than nonconserved missense variants. Similarly, missense variants should be prioritized higher than synonymous variants. Functional consequences of variants can be predicted by examining effects of amino-acid changes using comparative sequence analysis and protein structure analysis. Many computational prediction and conservation74, 75 methods are available (reviewed in refs.76–79). The accuracy of those methods is around 80%80 and it is likely highest for rare variants (truly functional variants are most likely deleterious and kept at low frequencies by purifying selection, and so common variants are most likely neutral and nonfunctional). Therefore, using prediction methods enriches for functional variants and thus boosts the power of association tests. Because such predictions are not perfect, however, they should be used quantitatively by weighing variants, rather than qualitatively by filtering out variants. A number of tests allow including prediction scores into test statistics, e.g., VT test, KBAC, SKAT, Rare variant Weighted Aggregate Statistic (RWAS)72, Likelihood Ratio Test (LRT)73. The PLINK/SEQ suite includes precomputed PolyPhen-246 prediction scores for all possible missense changes in humans, which makes these scores readily applicable.
An important consideration for exome sequencing studies is selecting the significance threshold that accounts for multiple testing. A simple way is to adopt a Bonferroni correction for 20,000 independent tests (one test per each gene), which, for an experiment- wide significance of 0.05 gives a p-value threshold of 2.5 × 10−6 per gene. However, such a threshold may be overly conservative because it assumes that each tested gene has sufficient variation to achieve the asymptotic properties for the test statistic. For example, if only 2 individuals carry non-synonymous variants in a given gene, the difference between cases and controls never exceeds 2 total observations, and so the most significant p-value that can be achieved is around 0.25 assuming that these 2 variants are independent. Therefore, unless the study is large, association p-values will be generally less significant than expected under the null hypothesis. Figure 2a demonstrates this effect on the 438 whole exomes. The PLINK/SEQ suite computes from data the so-called i-stat, which is an estimate of the minimal achievable p-value for a gene. The i-stat can be used by setting a threshold (e.g., 10−3) and only correcting for the number of genes that have the i-stat below the threshold following the idea that for the genes with i-stat above the threshold there is no power to find an association. Another way to correct for multiple testing is to compute an experiment-wide significance threshold by permutations of phenotype labels, create the empirical distribution of minimal p-values for all genes across permutations, and compare the minimal p-value from the real data to that distribution (Figure 2b). This approach efficiently controls Type-I error and is less conservative than the Bonferroni correction. Importantly, the p-value threshold computed by permutations is dependent on both the study and on the statistical test. However, the experiment-wide correction via permutation is not robust to confounding and it is essential to assess the quality of the distribution of test statistics, for those genes that have i-stats less than the threshold, to ensure appropriate calibration of the distribution. Nevertheless, with increasing sample sizes, the dimensionality of the tests will also increase, and studies will be assessing close to 20,000 tests. Therefore, for large studies we consider the Bonferroni threshold to be preferable.
Power of an exome-sequencing study is limited by the amount of variation in a gene. Therefore, power is higher for genes with more variants, for example longer genes or genes in regions of elevated mutation rate. Additionally, genes in which most variants are causal are easier to identify than those in which few variants are causal. In individual candidate gene sequencing studies estimates of this proportion ranged from 30% to 70%21, 22, 26. Consequently, the effect size is not only a property of an individual variant, but rather a reflection of the distribution of effects coupled with how those effects are interpreted via the test. Some statistical tests explicitly account for differences in power when evaluating evidence of association81.
Given the sample sizes, the likely effect sizes and frequencies of causal variants, and the proportion of causal variants in a gene, do current exome sequencing studies have sufficient power to detect genes underlying complex phenotypes? The enthusiasm about exome sequencing studies stems, in part, from successful candidate gene sequencing studies, and so we sought to test whether exome sequencing would be expected to have sufficient power to detect genes discovered by the candidate gene approach. So far, no published candidate gene study reported p-values that would be significant on the background of the complete exome (Table 2). This is particularly striking because some candidate gene studies used much larger sample sizes (thousands of individuals) than ongoing exome sequencing studies (hundreds of individuals). This demonstrates that current exome sequencing studies are underpowered to detect genes with the allelic distribution and effect sizes similar to the published examples. Indeed, extrapolation of effect sizes and frequencies from published studies shows (Figure 3) that thousands of individuals are required to reach acceptable statistical power. This analysis is consistent with an earlier study based on population genetic simulations that concluded that as many as 10,000 individuals at phenotypic extremes would be needed to achieve satisfactory power30. The very first GWAS82–84 were also highly underpowered but the combination of falling costs and combining studies in meta-analyses enabled rapid creation of well-powered studies and many discoveries. Similarly, with the falling cost of sequencing and targeted enrichment85, exome sequencing will soon be affordable to many research groups, and we expect that consortia will form to facilitate pooling of exome sequencing data, thus enabling better powered studies and a new wave of discoveries.
To discover robust associations, replication in exome sequencing studies will be critical. Because small early studies will be inevitably underpowered, no gene may achieve exome-wide statistical significance. In such cases, unless strict correction for multiple tests is performed, researchers should resist the temptation to apply a battery of statistical tests, each with various weighing schemes and variant selection. We strongly argue that an association can only be considered real if it has been replicated. A reasonable replication strategy is to select a few genes (e.g., 10), based on the strength of association86 from the discovery stage and prior biological plausibility. Sequencing and rare-variant associations must then be performed on new samples, using multiple-test correction threshold applied only to the (smaller) set of candidate genes.
Population stratification—systematic ancestry differences between cases and controls—is a well-studied confounder in genetic association studies87. In GWAS, commonly used approaches to correct for stratification include stratifying by population cluster (Structured Association), principal components analysis (PCA) and mixed models87–90. Genomic Control may also be applied, but it is generally more useful for assessing stratification than correcting for stratification87, 91.
An important question is whether population stratification can confound exome sequencing studies, and if so, how to correct for stratification in this context. Although excess-of-rare-variant tests are fundamentally different than single-variant tests, the possibility of stratification still exists because different ancestries within a structured population sample (e.g., African and European ancestry in African Americans, or northern European and southern European ancestry in European Americans) may have different allele frequency spectra due to their different demographic histories. For example, in an exome sequencing study in African Americans in which disease cases have more African ancestry than controls, one expects to see an excess of rare variants in cases, because African chromosomes carry more rare variants92.
We created a hypothetical case-control exome sequencing study involving real sequencing data and simulated phenotype data using 438 individuals, split in two populations (see Methods). To induce population stratification, we assigned case-control status to each sample randomly with a bias to take more cases from one population, and more controls from the other population. Association tests indicated inflated rates of spurious statistically significant p-values. We corrected for population stratification by modifying the permutation scheme to account for subpopulations. This correction was effective at controlling Type-I errors in all association tests.
Our simulations demonstrate that exome-sequencing studies can be affected by population stratification, which may produce spurious associations. We have shown that a simple permutation scheme is sufficient to correct for population stratification when discrete clusters corresponding to genome-wide ancestry are known or can be inferred by applying standard methods to GWAS chip data88, 89, 93. The permutation scheme is appealing in that it generalizes most burden of multiple rare variants tests, however, some tests may also be amenable to the use of PCA covariates in instances in which population structure is best described by continuous clines rather than discrete clusters89.
Exome sequencing studies bring the promise of comprehensive testing of coding variation in an unbiased manner. However, we expect that initial studies will be underpowered, and we have highlighted a number of technical issues that could bias the interpretation and analysis of rare variant data, especially novel variants. We expect that thousands of exomes are going to be required to achieve sufficient statistical power to robustly detect associations of rare variation with complex traits. Issues we discussed in this Perspective are also relevant to future whole-genome sequencing studies, in which the analysis of protein-coding variation will remain the same as in the case of exomes.
Focusing exclusively on exome is an especially serious limitation in complex trait genetics, where noncoding genetic variation is believed to play a larger role than in Mendelian genetics or in somatic cancer genetics. However, there are clear reasons to start with exomes. First, statistical approaches combining multiple rare variants are problematic in non-coding regions because there is no easily identifiable set of sites harboring variants with unidirectional phenotypic effects. Second, variants in regulatory regions are likely to have smaller effect sizes. In contrast, protein coding genes provide a well-defined and interpretable target for mutations in the locus. These mutations create variants that, in a well-powered study, highlight association of the locus with the trait. Thus, although focusing on the exome is unlikely to explain all of heritability, it has the potential to highlight genes involved in complex traits.
Despite challenges discussed in this Perspective, the observation that a large trove of functionally significant coding variants exists in the human population brings hope that the exome sequencing approach will ultimately help identify many loci important for complex traits and diseases.
To calculate the discovery rate of novel variants for increasing numbers of samples, first all exome samples are arranged in a random order. Then, samples are analyzed sequentially, starting with the first sample, and the cumulative set of identified variants is computed. For every subsequent sample, a variant site is considered novel if that site has not been identified as variant in the cumulative set of preceding samples. The fold-increase over baseline (where the baseline for each class is the number of variants discovered in the first sample) is plotted in Figure 1. To avoid sampling bias, random resampling is performed and the overall mean is calculated. Nonsense, Missense, and Synonymous classes are based on RefSeq annotations. The Missense class is further divided into “Probably damaging”, “Possibly damaging”, and “Benign” subclasses according to PolyPhen-2 predictions46. The “Theoretical” line plots the expected number of segregating sites under a neutral model of evolution in a population of constant size41.
Reads were aligned to the reference genome using Burrows-Wheeler Aligner (BWA)94, PCR duplicate reads were removed using Picard (see Web Resources), base quality scores were recalibrated using the Genome Analysis Toolkit (GATK), and alignments near putative indels were refined using GATK. The resulting data was run through the GATK to discover and genotype SNP candidates.
We used the following QC filters: a (1) quality-score-vs.-depth filter, which excludes variants whose depth-normalized discovery confidence does not exceed 2.0; (2) a homopolymer-run filter, which excludes variants that have an alternate allele that matches the allele in an immediately adjacent homopolymer-run of length greater than 5; (3) a strand-bias filter, which excludes variants whose alternate allele is preferentially found on one of the two available read orientations at the site, and (4) an indel-mask filter, which excludes variants discovered at sites that overlap with indels.
Case/control status was assigned randomly and a T5 test for burden of rare variants was executed on all genes (T5 is a variant of the CMC test58 that considers only non-synonymous variants with minor allele frequencies below 5%, uses the total count of alternative minor alleles in cases as the test statistic, and assigns significance by permuting phenotype labels). The overall deflation in significant p-values (i.e., there are fewer genes associated at any significance level than expected by chance, is due to low counts of variants in genes. Results were similar for T1 version of CMC, as well as for WSS59, and VT tests61. This pattern is expected in studies with small sample sizes (below around 1000 individuals). Whole-exome permutations can be used to establish exome-wide significance in such cases.
Phenotype labels of full exomes were permuted 1,000,000 times, i.e., permuted phenotype affected all genes in an individual. In each permutation, the lowest exome-wide p-value was computed. It took fewer than 1000 computing hours to run 8 statistical tests on the 1 million whole-exome permutations of 15,122 genes in 438 individuals. The computation is very easy to parallelize and thus quite affordable using cluster or cloud computing.
Data was extrapolated from results from five candidate-genes and one obesity gene set from published studies (Table 2). Fisher's exact test was used to calculate p-values after sample size extrapolations.
We induced population stratification in a hypothetical exome sequencing study involving real sequencing data and simulated phenotype data using 184 individuals from the HIV and 254 individuals from Schizophrenia (SCZ) exome sequencing studies. We observed that there were exome-wide differences in allele frequencies between the populations, which we quantified by estimating the FST between HIV and SCZ samples using exome sequencing data95. FST was estimated using the EIGENSOFT software. Using variants with minor allele frequencies at least 5%, we observed an FST value of 0.003, which is consistent with the different European ancestries of the HIV (European-American) and SCZ (Swedish) samples and with previous estimates of genetics distances between European populations96. We considered the possibility that the observed differences between HIV and SCZ samples could be due to differential bias resulting from differences in sample collection, sequencing, or data processing97, but view this as unlikely because we applied identical data processing and QC procedures to both sample sets and because QC metrics revealed no systematic differences between the sample sets.
To induce population stratification, we randomly assigned 80% of samples from HIV samples and 20% of SCZ samples as cases, and remaining samples as controls. We then used case-control labels to run four association tests: fixed-threshold approach (T1 and T5 versions of the CMC test58), WSS59, and VT test61. We quantified the evidence of population stratification by considering the most significant p-value (of 15,122 genes) and the proportion of p-values < 0.05, and < 0.01. As seen in the null distribution (Figure 2), it is expected that, due to low counts, p-values will have a deficiency of statistically significant signals. Before correction for population stratification, however, our metrics indicate an excess of statistically significant signals. For example, for T5, the most significant p-value was <0.000001, and the proportions of p-values were 0.0595 at level 0.05, and 0.0136 at level 0.01. Results were similar for the other statistical tests, and for other proportions of HIV samples assigned as cases (we experimented with 90%, 80%, and 70%, as well as 30%, 20%, and 10%). We note that when the proportion of HIV individuals assigned as cases was above 50% the induced inflation was higher than when the proportion was below 50%, which could be due to a population genetic excess of rare variants in Swiss and European-American samples as compared to Swedish samples.
To correct for stratification, we modified the script that implements association tests (see Web resources) to employ a permutation scheme in which case/control status was permuted within each population (HIV and SCZ), assuming known population labels. This permutation scheme does not change the computational cost of the study. The results show that the permutation procedure adequately controlled for population stratification, removing the excess of significant signals. For example, for T5, the most significant p-value after correction was 0.0001 and the proportions of p-values were 0.0340 at level 0.05, and 0.0060 at level 0.01. As mentioned previously, the deficiency of statistically significant signals is due to low counts and is consistent with the null distribution (Figure 2). Results were similar for the other statistical tests, and for other proportions of HIV samples assigned as cases. These results show that the permutation-based correction was effective at controlling Type-I errors.
This work was made possible, in part, by National Institutes of Health Grant 5R01MH084676, and in part, by the International HIV Controllers Study, supported by the Collaboration for AIDS Vaccine Discovery of the Bill and Melinda Gates Foundation (to P.I.W.d.B.), and the AIDS Clinical Trials Group, supported by NIH grants AI069513, AI34835, AI069432, AI069423, AI069477, AI069501, AI069474, AI069428, AI69467, AI069415, Al32782, AI27661, AI25859, AI28568, AI30914, AI069495, AI069471, AI069532, AI069452, AI069450, AI069556, AI069484, AI069472, AI34853, AI069465, AI069511, AI38844, AI069424, AI069434, AI46370, AI68634, AI069502, AI069419, AI068636, RR024975, and AI077505. The sequencing of the SCZ control individuals was funded by 3 sources: NIH grant RC2MH089905, the Herman Foundation, and the Stanley Medical Research Institute. N.O.S. was supported in part by National Institutes of Health Training Grant T32-HL07604-25, Brigham and Women's Hospital, Division of Cardiovascular Medicine. B.M.N. was supported by 1R01MH089208-01. The authors are grateful to Samuela Pollack for assistance with EIGENSOFT. The views expressed in this presentation do not necessarily represent the views of the NIMH, NIH, HHS, or the United States Government.
Competing Interests: The authors declare that they have no competing financial interests.
Author contributions: A.K. developed the computational analysis pipeline and analyzed data, K.G. performed upstream quality control and analysis of sequencing data, R.D. performed the power analysis, N.O.S. performed the assessment of rare variants in empirical data, B.M.N. contributed to statistical analyses, P.J.M. assisted with data analysis. T.L. participated in designing the study. P.S., P.F.S., J.L.M., C.M.H., P.L., P.M., P.I.W.D.B., N.G. and S.M.P. contributed data. A.L.P., P.I.W.D.B., and S.R.S. conceived and designed the study. A.L.P., P.I.W.D.B., S.M.P., and S.R.S. supervised the work. A.K., K.G., R.D., N.O.S., B.N.M., Y.Y.S., A.L.P., P.I.W.D.B., and S.R.S. wrote the manuscript. All authors approved the manuscript.
Data access. The SCZ control data can be accessed via the Database of Genotypes and Phenotypes (dbGAP), accession code phs000473.v1.p1. To access the HIV data, investigators can submit a brief concept sheet detailing their study design, research questions and other needs. The concept sheet with detailed instructions can be downloaded from: http://cfar.globalhealth.harvard.edu/fs/docs/icb.topic938249.files/Harvard%20CFAR%20Concept%20Sheet%20Template%20.docx
Please e-mail completed forms to Pamela Richtmyer (prichtmyer/at/partners.org). Requests will be reviewed on the basis of scientific merit, feasibility and overlap with ongoing concept sheets/investigations.