The Simulation Model Matters
Comparison of empirically obtained results with that of a known simulation model that reflects the true state of nature is a powerful tool for evaluating methods and approaches in statistical genetics. Simulation studies can provide insight into the statistical validity, the type I error rate and power, and the large- and small-scale sample properties of new statistical methods. However, it is important to realize that building a simulation model that mimics large portions of the genome is problematic. Because the characteristics of the genome are neither fully measured nor understood, particularly in terms of all the correlations and interactions between millions of SVs, the most prudent approach is to use actual sequence data as the basis for the underlying model. With this approach, the known (and unknown) structure in the genome can be incorporated without having to create a model with assumptions that may not accurately reflect the true state of nature. As described by
Almasy et al. [2011], complex trait phenotypes based on actual SVs can then be modeled as desired. However, in the simulated data, replications based on the same underlying sequence data were not independent, and the genotypes were identical over all the replicated samples. Thus the replicates were completely correlated with respect to genotype, and because at least some of the variation inherent in the phenotypes was generated on the basis of causal SVs, the phenotypes were correlated across replicates as well.
Cost and Sample Size Matter
The cost for next-generation sequencing is decreasing on a continuing basis. At present, a high-density SNP panel costs about $500, a whole-exome sequence costs about $2,500, and a whole-genome sequence with 30× coverage costs about $5,000. The cost to genotype one replicate of 697 individuals with a high-density SNP panel would be about $348,500; the cost to perform whole-exome sequencing for one replicate of 697 individuals would be about $1,742,500, and the cost for all 200 replicates would be about $350 million. Although the sample size of the data provided for GAW17 was fairly typical of what can be obtained or afforded at this point in time in terms of sequencing data, it was quite small compared to sample sizes for genome-wide association studies. Until the cost of sequencing approaches the cost of a GWAS, the detectable effect size for rare SVs that contribute to trait variation will remain high. Samples large enough to detect small effect sizes may be prohibitively expensive in the near future, although this will be less of a problem as sequencing costs continue to decline. Several contributors to GAW17 pooled replicates to increase the sample size, but at least in the near term, because of the cost to replicate even one sample, caution must be used in making generalizations about methods that pool replicates to expand the sample size. However, it is important to note that for most of the causal rare SVs, the estimated locus-specific heritabilities were so small that even pooling replicates did not substantially increase the power of these tests.
Minor Allele Frequency and Effect Size Matter
Rare or common, SVs have two main attributes: the frequency of the allele(s) of the variant and the size of the effect of the variant allele(s). Although much debate has focused on common versus rare alleles, what is most important is the size of the allelic effect of the variant and at what level the effect is measured (the individual, the family, or the population). The variant for familial hypercholesterolemia, for example, has a large effect in individuals with two copies of the variant, less of an effect in relatives who carry only one copy of the variant, and virtually no effect at the population level because the allele is so rare. Unless population studies are very large, rare causal variants with large effect size will be difficult to detect in population-based studies. However, in a family-based sample, because of the increased presence of a specific causal SV in relatives, the effect size will no longer be overwhelmed by its low population frequency and the ability to detect a rare variant will be amplified.
Data Quality Matters
The quality of the SV data is critical to the interpretation of the results. A number of data quality issues were raised at GAW17, some of which were identical to those for a GWAS and some of which were unique to next-generation sequence data. In terms of concordance with known SNP genotyping,
Stram [2011] and
Hemmelmann et al. [2011] noted that the concordance rate was only 88.5% for SVs shared by both the mini-exome data and HapMap genotyping calls for those individuals genotyped on both platforms. Furthermore, some SVs in the simulated mini-exome sequence data exhibited strong correlations with SVs on different chromosomes [
Thomas et al., 2011;
Tintle et al., 2011]. Again, these interchromosomal correlations may be due to sequence misalignment or chance. When the data contain a large number of rare SVs and a relatively small number of individuals, correlation between SVs can be substantial, and sometimes even complete correlation occurs simply by chance alone.
The amount of missing data and the call rates are critical measures of data quality in a GWAS. With well-characterized high-density SNP genotyping platforms, markers are often dropped if the proportion of missing data exceeds a specified threshold. Individuals can be dropped if the proportion of genotyping calls fails to reach some minimal threshold as well. This is considerably more difficult with next-generation sequence data, because, by definition, the rare SVs can be quite rare. In this situation, it is difficult to distinguish between a rare SV and a sequencing error. Family-based samples provide the opportunity for Mendelian consistency checking, which can help to distinguish between a real segregating variant and a sequencing error. Finally, cryptic relatedness (unknown relatives in the sample) remains an issue for both high-density SNP genotyping and sequence variants.
Nonindependence of Sequence Variants Matters
It is important to note that the lack of independence between SVs is a larger issue in next-generation sequencing studies than in a GWAS because of the increased density of the SVs. As in a GWAS, nonindependence between SVs in linkage disequilibrium blocks is common, although in next-generation sequence data the density of SVs is increased and thus the number of highly correlated SVs is increased as well. In these data, however, nonindependence in SVs across linkage disequilibrium blocks also occurred. This was a key finding in both the Group 9 and Group 7 summaries [
Thomas et al., 2011;
Tintle et al. 2011, respectively]. These summaries noted that failure to address correlations between SVs increased the type I error rate over a wide variety of methods. Unlike typical analyses in a GWAS,
Thomas et al. [2011] noted that single-SNP tests are not reliable for association.
Enrichment for Phenotype and Genotype Matters
At GAW17, investigators presented a number of strategies to enrich the samples to be analyzed for the presence of rare SVs. On the phenotyping side these strategies included selection for discordant relative pairs, extreme phenotypes, and distributional extremes. For the population-based samples,
Bailey-Wilson [2011], in the Group 14 summary, reported that the power to detect association was increased when subsamples were selected for extreme phenotypes [
Lamina, 2011]. On the genotyping side the strategies included the use of families and rare variant counts in a case-control framework. In the Group 10 summary,
Melton and Pankratz [2011] described methods that used the joint analysis of multiple correlated phenotypes to reduce the phenotypic dimensionality and methods that reduced the genotypic dimensions.
Families Matter
It is important to realize that the use of families is an enrichment strategy for both phenotype and genotype. The selection of highly loaded families focuses on those families with a high proportion of affected individuals or large trait variation. The selection of extended families will overrepresent some rare causal variants if they are present in the founders or in individuals who marry into the family, and then segregate throughout the extended family. By the same token, some rare causal variants will be lost if they are not present in the set of founders or individuals who marry into the family or if they segregate out of the family in subsequent meioses. The amplification of a rare SV was particularly relevant in Family 7, in which a rare causal SV in the VEGFC pathway was present in the founders and then segregated throughout the entire extended family. A number of contributors were able to identify causal SVs in the VEGFC and VEGFA genes because the estimated locus-specific heritabilities were higher in the family samples than in the population-based samples.
Family data can also provide other relevant information, such as estimates of locus-specific heritability and candidate regions based on linkage analysis. In the Group 11 summary,
Hinrichs and Suarez [2011] noted that information from family-based samples could be used to identify candidate regions, identify individuals for sequencing, and provide weights for intrafamilial tests of association.
How You Treat Rare Variants Matters
When the SVs are rare, it becomes necessary to collapse rare variants into new derived variables in order to analyze them. Many different methods were used to collapse or aggregate SVs at GAW17, and these methods are summarized by
König et al. [2011],
Melton and Pankratz [2011],
Sun et al. [2011],
Sung et al. [2011],
Tintle et al. [2011], and
Ye and Engelman [2011]. In general, the analysis of collapsed variants had somewhat better power than that of uncollapsed variants, although the performance of collapsing and aggregating methods depended to a large extent on the underlying genetic structure and the use of unrelated individuals or family data. Summarizing a somewhat different approach,
Kent [2011] noted that using common variants to tag rare variants can work, but not very reliably.
The performance of the collapsing methods is tempered somewhat by the fact that all the causal variants in the underlying simulation model had allelic effects in the same direction, which is most likely not the case in the real world [
Bickeböller et al., 2011].
The methods described by
Ye and Engelman [2011] and by
Tintle et al. [2011], for example, illustrate a number of collapsing schemes, some previously proposed and some novel. For the most part, these methods focused on the sample of unrelated individuals. Not unexpectedly, results from these methods identified genes containing SVs with the highest locus-specific heritability, for example, SVs in
FLT1 and
KDR in trait Q1.
Ye and Engelman [2011] noted that the number of SVs in the derived collapsed variants was problematic because some genes had many SVs, whereas others only had one. Taking this one step further,
Cantor and Wilcox [2011] reported a variety of methods for aggregating SVs into haplotypes and multiallelic genotypes in the Group 13 summary.
Results Matter
Based on the estimated locus-specific heritabilities for Q1 ( and ), it was not surprising that many methods found causal SVs in the
FLT1 and
KDR genes in most replicates in both the population-based and the family-based study designs. The estimated locus-specific heritabilities for SVs in these genes were greater than 0.03 and 0.02 in the population- and family-based samples, respectively.
Melton and Pankratz [2011] and
Bailey-Wilson et al. [2011], for example, reported that most of the methods considered were able to detect the SVs in
FLT1 and
KDR in the population-based sample, but, as in other methods, a substantial increase in type I error over the nominal rate was observed. In the family-based sample, however, SVs in
VEGFC and
VEGFA had the highest locus-specific heritabilities (>0.2) followed by SVs in
FLT1 and
KDR (). This was primarily due to founder effects and the subsequent segregation of the SVs in the families, both largely the result of chance. The locus-specific heritabilities for most of the other causal variants were quite low, generally less than about 0.003, because either the allele frequency was low or the allelic effect size was low, or both.
In Q2, the estimated locus-specific heritabilities were nearly all less than 0.01 in the population-based sample and, except for SVs in SIRT1 and VLDLR, all less than 0.001 in the family-based sample. In general, SVs with locus-specific heritabilities less than 0.01 were not detectable with any substantial degree of power. There were no causal SVs in Q4.
Power and Type I Error Matter
Many of the proposed new methods were evaluated by counting the number of times a true causal variant was identified as a causal variant over all replicates. It should be emphasized that, because of the nonindependence of the samples, this estimate was a surrogate for power (i.e., the proportion of samples that identified the variant as causal) rather than a true estimate of the power of the test. Furthermore, the confidence intervals on these estimates for 200 replicates were quite large, making comparisons across methods problematic, and the confidence intervals would be even larger for smaller critical values.
Similarly, just as the estimate of the proportion of samples that identified a true causal variant as a causal variant was taken to be a surrogate for power, the proportion of samples that identified a noncausal variant as a causal variant was taken to be a surrogate for type I error. There are, however, two null hypotheses that can be tested. Under the null hypothesis of no genetic component, no variants influence the phenotype; that is, the phenotype is essentially a normally distributed random variable (i.e., trait Q4). Under the alternative null hypothesis, no causal variants are among the set of variants considered; however, other unknown causal variants do, in fact, influence the phenotypic distribution (e.g., Q1, Q2, and the underlying liability of the discrete trait). These variants affect the correlation structure of the phenotypes and can be correlated with other known or unknown variants.
In general, the estimate of the type I error rate was elevated over a broad range of methods and in both the family and unrelated individuals data for traits Q1 and Q2. Even stringent critical values resulted in more observed type I errors than expected at the corresponding nominal critical value. This is relevant for the estimates of power as well, because the power of a test is related to its type I error rate (i.e., the power of the test at the null hypothesis is, in fact, the type I error rate). This may be due to the alternative null hypothesis problem, with SV correlations inflating both the type I error and power.
König et al. [2011],
Sun et al. [2011], and
Sung et al. [2011] noted that the expected type I error rate for Q4, a trait with no causal variants, was close to its expected nominal rate. When there was no genetic component in the simulated model, most of the methods appeared to be statistically valid such that the estimated type I error rate was close to the nominal rate. This was clearly not the case for traits Q1 and Q2. This suggests that the inflated type I errors are due to correlations in the data when there is any underlying genetic component in the model. These correlations may be due to chance, the quality of the sequence alignment, linkage disequilibrium, gametic disequilibrium, or other unknown structural genetic relationships.
Bailey-Wilson et al. [2011],
Cantor and Wilcox [2011],
Sun et al. [2011],
Tintle et al. [2011], and
Ye and Engelman [2011] all noted that in the population-based samples, the power to detect associations was not high, even for those SVs with the highest locus-specific heritabilities (e.g.,
FLT1 and
KDR in trait Q1), and that the type I error rate was inflated, sometimes markedly so.
Tintle et al. [2011] postulated that the elevation in type I error rates could be due to either population stratification or correlations between SVs on different chromosomes. They noted that principal components analysis helps to ameliorate population stratification, although it also resulted in the loss of rare SVs. In general, methods that analyzed the population-based samples found some of the causal SVs, often those with the higher locus-specific heritabilities, but the power to detect associations with rare SVs was quite low, and the proportion of type I errors was substantially elevated over the nominal levels.
Study Design Matters
Because the simulation study provided both population- and family-based samples, the GAW17 participants used a wide range of sampling strategies. Study designs ranged from population-based case-control studies for the disease to loaded family-based designs with extended families for the quantitative traits. Methods of analysis ranged from simple linear regression, to intrafamilial tests of association in parent-offspring trios and in nuclear and extended families, to linkage analysis in nuclear and extended families.
Kazma and Bailey [2011] presented the difference in results obtained from different sampling strategies and methods of analysis. Although both population- and family-based studies found SVs in
FLT1 and
KDR for trait Q1, only the family-based intrafamilial tests of association and the linkage method found SVs in
VEGFC and
VEGFA. The estimated locus-specific heritabilities in the family-based sample were about 0.2 for both
VEGFC and
VEGFA and were considerably larger than for the SVs in
FLT1 and
KDR; these are, in fact, the top seven SVs in . It is clear that family-based designs are more effective for detecting rare variants with large effects in
families.
Incorporating Prior Information Matters
Not unexpectedly, including prior information when it is, in fact, involved in the etiology of a trait helps to identify causal SVs. This information includes relevant covariates, correlations among SVs, and external knowledge of genes and pathways involved in the expression of the trait. As in a GWAS, incorporating information about population substructure and linkage disequilibrium can also help to identify causal SVs. For example, information about coding regions for proteins and nonsynonymous versus synonymous base substitutions can be used to help formulate collapsing strategies that could increase the power to detect rare variants. Similarly, information from noncoding elements, alternative splice sites, and gene pathways can help to identify functional domains or gene sets that can be used to increase the power to detect causal variants. In the Group 3 summary,
Chen et al. [2011] reported that considering only nonsynonymous SVs, for example, increased the power of the test. Both
Chen et al. [2011] and
Namkung et al. [2011] reported that using different types of information improved the detection of some of the causal SVs and that in at least some situations, the power to detect associations was somewhat better for genes and pathways than for individual SVs.
Multiple Testing Still Matters
As noted by
König et al. [2011], problems associated with multiple testing are magnified in next-generation sequence data because the number of SVs greatly exceeds the number of SNPs in a GWAS. The GAW17 contributors considered several approaches to adjust for multiple tests, including the traditional Bonferroni test, resampling methods, and reduction of dimensionality. Other contributors focused on model fitting (e.g., machine learning approaches) rather than on adjusting for multiple tests. Although some of the methods were quite innovative, no single method performed substantially better than the others or better than the methods used in genome-wide association studies.
Computational Issues Matter
It should be acknowledged that there is substantial additional computational burden for methods that use families, maximum-likelihood estimation, any form of penalized regression, and machine learning. Any method that uses an iterative process, a process with high dimensionality, or permutation testing will increase the computational time, perhaps only marginally for each SV, but the cumulative computational burden over millions of SVs may make the analysis impractical with present-day hardware and software [
Bickeböller et al., 2011]. Extremely computationally intensive methods may not be feasible with today’s current infrastructure, and analysis of whole-exome sequence data will have to be able to be completed in hours, days, or weeks, not years. Other bottlenecks include data storage capacity and data transfer rates. Care must be taken to ensure that proposed new methods are computationally tractable, and there may have to be a trade-off in terms of the appropriateness of the method and the computational time required to analyze all the sequence variants in the entire genome.