|Home | About | Journals | Submit | Contact Us | Français|
We summarize the work done by the contributors to Group 13 at Genetic Analysis Workshop 17 (GAW17) and provide a synthesis of their data analyses. The Group 13 contributors used a variety of approaches to test associations of both rare variants and common single-nucleotide polymorphisms (SNPs) with the GAW17 simulated traits, implementing analytic methods that incorporate multiallelic genotypes and haplotypes. In addition to using a wide variety of statistical methods and approaches, the contributors exhibited a remarkable amount of flexibility and creativity in coding the variants and their genes and in evaluating their proposed approaches and methods. We describe and contrast their methods along three dimensions: (1) selection and coding of genetic entities for analysis, (2) method of analysis, and (3) evaluation of the results. The contributors consistently presented a strong rationale for using multiallelic analytic approaches. They indicated that power was likely to be increased by capturing the signals of multiple markers within genetic entities defined by sliding windows, haplotypes, genes, functional pathways, and the entire set of SNPs and rare variants taken in aggregate. Despite this variability, the methods were fairly consistent in their ability to identify two associated genes for each simulated trait. The first gene was selected for the largest number of causal alleles and the second for a high-frequency causal SNP. The presumed model of inheritance and choice of genetic entities are likely to have a strong effect on the outcomes of the analyses.
The Genetic Analysis Workshop is conducted every two years to encourage statistical geneticists to develop and evaluate statistical approaches to the data analysis questions that confront geneticists. Data are made available by the organizers to interested participants, who meet to discuss and compare their work. We report here a summary of the work from one group of participants at Genetic Analysis Workshop 17 (GAW17). The data consisted of 24,487 single-nucleotide polymorphisms (SNPs) and rare variants in 3,205 genes generated for 697 individuals with data from the 1000 Genomes Project. One binary and three quantitative traits were simulated to be associated with subsets of the detected variants. Two data sets were provided. In one set, data were generated from a core sample of individuals to form eight families from those sequenced, and in the second set the individuals in those families were treated as independent, as described more fully by Almasy et al. .
We provide our synthesis of the analyses of these data reported by the contributors to Group 13; the analyses themselves are published in BMC Proceedings [Ayers et al., 2011; Bueno Filho et al., 2011; Cherkas et al., 2011; Christensen and Lambert, 2011; Kraja et al., 2011; Sykes et al., 2011; Wilson et al., 2011; Yan et al., 2011]. The eight work groups considered associations of both rare variants and common SNPs to investigate a variety of statistical methods for detecting associations of traits with genes characterized primarily by their rare variants. All work groups analyzed multiallelic genotypes or haplotypes; however, there was a remarkable amount of flexibility and creativity in how the variants and their genes were coded and analyzed and in how the proposed approaches and methods were evaluated. We describe and contrast these contributions along three dimensions: (1) selection of genetic or genomic entities for analysis and coding of the data, (2) method of analysis, and (3) approach to selecting and evaluating results of the analyses.
A common theme exhibited by the eight work groups was that each analysis was conducted with an understanding that association testing of single rare variants (frequency < 1%) lacked power. Given that most of the GAW17 variants were rare, all the methods and approaches were selected and designed to maximize power. Aggregation of SNPs and rare variants into multiallelic genotypes or haplotypes was the central feature, although the approaches taken to accomplish this were diverse. All methods required a prior definition of a genetic or genomic entity encompassing something larger than a single rare variant and a rule for coding the SNPs and variants. These coding choices are summarized for each of the contributions in Table I.
Four of the work groups preserved the individual variant genotypes across the genome using multilocus coding into a variety of entities [Ayers et al., 2011; Bueno Filho et al., 2011; Wilson et al., 2011; Yan et al., 2011]. Perhaps the most general approach was taken by Ayers et al. , who used a set of 36,612 gene or sliding window multilocus genotypes defined by the variants within the 3,205 genes and their sliding windows. Ayers and colleagues preserved the order of the variants and their heterozygosity but did not assign haplotypes. Each one of the multiallelic sliding windows was assessed by penalized regression using the full set of information on the variants in each individual. Ayers and co-workers looked at quantitative trait Q2, which had 72 causal variants in 13 genes that operated in an additive fashion to increase the value of Q2.
Although they used a completely different analytic approach, Wilson et al.  preserved the multilocus genotypes as a string of 0′s and 1′s, coding the homozygotes as 0 and the heterozygotes as 1 in the 3,205 genes. Each gene was assessed separately using tree-based clustering derived from the 0′s and 1′s and the trait values. Traits Q1, Q2, and Q4 were adjusted for Age, Smoking, Sex, and principal component loadings, and standardized residuals from the 200 replicates were analyzed with the genotypes.
Bueno Filho et al.  also preserved the multilocus genotypes across the 3,205 genes by nesting the variants within the genes and conducting hierarchical Bayesian analysis for the dichotomous trait. To satisfy the assumptions of the method, 576 variants were removed because of collinearity, leaving 23,961 variants for analysis. Low-frequency variants were up-weighted, and adjustments for covariates were conducted. Analyses focused on the associated genes and the associated variants within the genes.
Yan et al.  retained all the genotype data and coded it in a similar fashion using a Bayesian regression approach. All 24,487 variants were coded as 0, 1, or 2 copies for each individual in the sample and were assessed simultaneously in a single analysis for association with Q1, Q2, and Q4. The complete data were also used to estimate the number of associated SNPs for each trait, which was used as a threshold for selecting the associated variants.
Three additional work groups treated the variants differently by reducing the amount of data for analysis. The first limited the genes tested by requiring the presence of rare variants. Christensen and Lambert  focused on nonsynonymous variants and reasoned that for them to combine effectively to contribute to the value of a trait, they should be on the haplotypes of opposite DNA strands. Christensen and Lambert haplotyped all the variants, and if a nonsynonymous variant was seen on both haplotypes in a gene, the gene was coded as B/B for that individual. Those with zero or one nonsynonymous variant in the gene were coded as A/A and A/B, respectively. Christensen and Lambert then analyzed 2,196 genes separately using logistic regression to identify associations with the binary phenotype and multiple regression to identify associations with the quantitative phenotypes, whose values were estimated for each individual by averaging over the 200 replicates. Principal components were included to adjust for variation in ethnicity. Multiple testing for association was addressed by applying a Bonferroni correction to the level of significance.
The second contribution that selected entities for testing had a much more stringent requirement for the genes to be tested. Sykes et al.  reduced the multilocus genotypes to those residing in genes with one or more associated SNPs. Thus only genes with at least one common SNP associated with one of the traits were tested for association with the rare variants. The rare variants that were associated had to contribute to the established association beyond that accounted for by the SNP used to include the gene. The genes were also limited to those exhibiting between 2 and 16 rare variants, resulting in 829 genes available for testing. This reduced set of genes, in conjunction with the high correlation between SNPs, allowed Sykes and colleagues to use a less stringent multiple testing criterion of α (0.001) than those applied in the other studies. Before all genetic analyses, Sykes and colleagues assessed covariates using logistic regression and estimated significance of the covariates by conducting the analysis in 200 replicates. Coding per gene was a series of 0′s and 1′s, as was done by Wilson et al. . Sykes and colleagues tried many different combinations of SNPs to assess the types of associations they could detect and viewed this work as a proof of principle regarding the inclusion of common variants to reveal the importance of rare variants in associations.
The third contribution that selected entities for testing limited the variants tested based on the study design. When using family data, Kraja et al.  tested genome-wide association of the individual SNPs with trait Q2 using a linear mixed-effects model in the eight families. In contrast, for the independent individuals, they used sliding windows of sizes 2 and 3 SNPs to assess the Q2-associated genes identified by the initial study of the families. Kraja and colleagues also used the tagging SNPs to define a Bonferroni-corrected significance threshold for the association tests.
Cherkas et al.  used several methods to explore potential associations of genes, sliding windows, and pathways with the binary trait. They extended the p-value results of these association tests derived from a collapsing method [Dering et al., 2011] in two ways to include multiallelic studies. The first way included associated genotypes in a multiallelic analysis of genes binned into functional pathways identified through an established database. The pathways were then ranked using the mean log p-value approach, described in previous work. The second way detected the most important genes using a tree-based method. The trait was adjusted for covariates and ethnicity, as assessed by clustering. The initial elements of analysis were 22,615 variants grouped into 5-kb-size windows of multiallelic genotypes.
Thus the variety in the nature and number of entities tested by each group illustrates the wide latitude researchers have in investigating the role of rare variants using multiallelic analyses. Competing factors for choice of entities are coverage and focus. The greater the coverage, the more likely it will be to identify an association, although the required level of significance may have a substantial impact on power. However, conducting the analyses with a specific hypothesis that targets specific genetic entities can provide an important focus, if the hypotheses are correct. Testing genes where there is prior knowledge of a SNP association with the trait limits the sample space considerably, and requiring a rare causal variant on both strands of DNA within a gene allows only genes relevant to that biological model to be tested.
The Group 13 contributors applied a variety of analytic tests to the GAW17 data, and Table II provides a summary of the approaches taken by the eight work groups. Four of the groups used regression-based models [Ayers et al., 2011; Christensen and Lambert, 2011; Sykes et al., 2011; Yan et al., 2011], and they ranged from logistic regression and a least absolute shrinkage and selection operator (LASSO) approach to a Bayesian approach [Dasgupta et al., 2011]. Mixed-effects models, mixture models, tree-based methods, and pathway analyses were used by the other groups. All contributors permitted the simultaneous inclusion of multiple markers and covariates within their analyses. A number of these methods are presented in greater detail by Dasgupta et al.  and Dering et al. , and the full descriptions of the approaches used by the members of Group 13 are given in the original papers. Here, we briefly summarize the approaches, organizing them by whether the analyzed phenotypes were binary or quantitative.
Three groups examined the binary phenotype alone, and their approaches differed substantially. Sykes et al.  tested combinations of multiple rare variants with one common variant within each gene using logistic regression and assuming a dominant inheritance model. Covariates (Age, Smoking status) were included in the model based on a backward selection procedure. Analyses were restricted to genes with fewer than 16 rare SNPs (829 genes). The analyses were repeated in each of the 200 replicates. The search space was narrowed by retaining common variants with p < 0.10 across all replicates. Sykes and colleagues then fitted a new logistic regression using an indicator to represent the presence or absence of any minor allele within a gene. Finally, they removed one rare SNP at a time from their models to reduce any noise arising from too many rare variants.
Bueno Filho et al.  used a Bayesian hierarchical mixture model to identify genes associated with the binary phenotype. An additive genetic model was assumed. The genotype design matrix was weighted by allele frequencies in two ways. First, SNPs that were linearly dependent with SNPs already in the model were omitted. The second approach standardized the matrix of SNP effects to represent a population drawn with random mating. This group included Age, Sex, and Smoking status as covariates in the model estimation process. Metropolis-Hastings and Gibbs sampling were used. Bueno Filho and colleagues have made their algorithms available in R.
Cherkas et al.  used a pathway-based approach and an ensemble tree-based method in an examination of the binary trait. They began by using collapsing methods [Dering et al., 2011] to identify and code rare variants and then used pathway methods to combine this information for multiple genes. The first method examined was the mean log p-value, a method that takes into account information about SNP function and ontologic pathway. In essence, SNPs (genes) are grouped by functional implication. The ranking of the sets of genes, or pathways, was determined using permutation tests of the gene p-values obtained from the SNP and rare variant p-values. Cherkas and colleagues incorporated a second empirical method for examining pathway structure: stochastic gradient boosting. This method is a sampling scheme that samples both the rows and columns of the data matrix. Each successive tree is conceptually fitted on the residuals from the prior trees. Models are evaluated on performance on a test or hold-out sample (e.g., jackknife). Cherkas and co-workers used an iterative strategy to identify SNPs associated with the binary trait and placed the SNPs in genes and the genes in pathways for evaluation.
Two other groups analyzed the binary trait and at least one quantitative trait. Christensen and Lambert  used logistic regression to examine the association between the binary phenotype and the compound genotypes they constructed. The first model was estimated assuming an additive inheritance pattern with known phase. The model was estimated a second time using a dominance model to assess the effect of collapsing rare variants within genes without considering phase. Christensen and Lambert continued their investigation using additive inheritance linear regression models to investigate the association of the three quantitative traits with compound genotypes from each of two methods for estimating phase. They used a dominant inheritance linear regression to assess the quantitative traits without phase information. All models were adjusted for population stratification using the first three principal components of the model. No other covariates were included in their models.
Ayers et al.  used a sliding window multimarker genotype approach in a penalized regression framework for the binary phenotype. The penalty had three components: (1) a group LASSO in which multimarker genotypes within a gene tended to be included or excluded in the model as a group, (2) an allele-sharing penalty in which multimarker genotypes within a SNP window had similar coefficients, and (3) a penalty that shrank the size of coefficients while performing model selection, resulting in overall sparseness. This approach was compared with a single-marker analysis and a gene-based sparse group LASSO using the quantitative trait Q2 and Age, Smoking status, Sex, and reported ethnicity as covariates. The analyses were conducted in all 200 replicates taken separately.
The other three groups focused on the quantitative traits alone. Yan et al.  used a Bayesian regression method that was designed to estimate the contributions of all common and rare variants to a trait simultaneously and to provide an estimate of the number of associated variants for that trait. That is, a single analysis was conducted on all the data, and in addition to a parameter for each variant, additional information on posterior probabilities provided information regarding the number of variants contributing to the trait. Q1, Q2, and Q4 were analyzed separately. Bayesian regression models have been used in animal breeding experiments, providing a theoretical background for Yan and colleagues’ approach.
Kraja et al.  analyzed quantitative trait Q2 in both the large pedigree and the independent individual data sets using prior knowledge of the GAW17 answers. The effect of Age was removed by computing residuals, and an indicator variable for Sex was included in the modeling as a covariate. This work group used a linear mixed-effects single-marker approach in eight large pedigrees, assuming an additive genetic effect. They then used a sliding window haplotype association test in the unrelated individuals for support of their original associations.
Wilson et al.  used an extension of lexical trees for quantitative traits and multilocus genotypes. They analyzed both the raw quantitative traits (Q1, Q2, and Q4) and the residuals, accounting for Age, Smoking status, Sex, and Ethnicity as represented by principal components analysis. Presence or absence of a rare variant was coded. They conducted their preliminary analyses in the first replicate and assessed statistical power using all 200 samples. Trees were used to elucidate the genotype structure and to identify patterns of rare variation. Z scores were used to assess the difference between trait values in the nodes of the tree compared with the entire sample. The statistic for the tree was the maximum Z over the entire tree. This group has made R code available for their proposed method.
Group 13 contributors focused their statistical analyses on multiallelic genotypes, and one might expect that comparing their power and type I errors would be straightforward. However, a number of factors hampered this goal. The variation in multiallelic entities that were tested was confounded with power of the individual methods. The variety in entities resulted in substantial differences in criteria for declaring an association among the eight work groups. In addition, there was a range of exome coverage, and for two work groups, the genes that were analyzed were limited by specific rules that also eliminated some of the associated genes. Perhaps more critical is the fact that the GAW17 data resulted in analyses that suffered, almost universally, from substantial type I statistical errors. Although efforts have gone toward diagnosing this problem, assessing a false-positive rate is a poor criterion for comparison of methods, and the large number of observed associations made it difficult for the contributors to report them comprehensively. Compounding these factors, the contributors evaluated their methods using a wide variety of evaluation criteria, and they referred to the GAW17 answers, which reported how the simulated traits were constructed from the genotypes, at differing times during the analytic process. Some groups assessed sensitivity and specificity, whereas others used a false discovery rate or type I and type II errors. The final column of Table II provides a brief overview of how evaluation was approached by the work groups. More details are provided in the following paragraphs, and the individual approaches are discussed more extensively in the papers themselves.
Ayers et al.  did not focus on the GAW17 answers provided initially but instead calibrated the work on the basis of results from other methods or various data configurations tested using their own methods. More specifically, they used the data set to compare several penalized regression approaches with each other and in contrast to the analysis of single variants. They also compared the penalized regression approach with a sparse group LASSO on single SNPs, with multimarker genotypes, and with an allele-sharing penalty as well as their proposed penalized regression method. Unfortunately, after comparing their results with the GAW17 answers, Ayers and colleagues reported that all methods had low power to detect the associations simulated in the GAW17 data, with power for the proposed method generally below 0.04.
Sykes et al.  used a one-sided paired t test with a threshold p-value of 0.001 for detection of associated rare variants along with an associated SNP. Signal strength improved with addition of rare SNPs to the common variant and with deletion of certain rare SNPs from a large combined group. Using their specific criteria regarding the presence of one associated SNP and fewer than 16 rare variants to define the set of genes available for consideration, Sykes and colleagues found 24 “real” and 802 “false” candidates. The two “real” and nine “false” candidates they detected had poor sensitivity (8.3%) but high specificity (98.9%), which correctly identified genes PTK2B and VNN1. The criteria for gene selection might be considered somewhat arbitrary, and it would be interesting to see whether they can be applied more generally.
Christensen and Lambert  performed tests on 2,196 genes, using a Bonferroni correction to assess significance (2.27 × 10−5). Focusing on a few genes, they correctly associated FLT1 with the binary phenotype across their analytic strategies. FLT1 was also associated with Q1. VNN1 showed a suggestive level of association with Q2 but did not meet the level of significance set by Christensen and Lambert. They repeated their analyses in the remaining 199 replications and identified more genes associated with Q2 than with any of the other traits examined. These results were accompanied by a number of false-positive findings and a large number of false negatives.
Kraja et al.  implemented a mixed-effects model in the large pedigrees followed by a sliding windows haplotype approach in the unrelated individuals; this strategy was effective for common variants but not for rare variants. The criterion for significance was a Bonferroni correction using the number of tag SNPs identified in the sample of independent individuals (4.3 × 10−6). Sensitivity and specificity were reported for each set of analyses. Analyses of the LPL and VNN1 genes had sensitivities greater than 30%. The sensitivity for the remainder of the simulated causative genes was substantially lower. These results clearly illustrate that methods developed for common variants are inefficient for detecting rare variation.
The Bayesian hierarchical mixture model implemented by Bueno Filho et al.  identified 58 genes associated with the binary phenotype; however, only one gene, PIK3C2B, was a true positive. The calculated sensitivity for Bueno Filho and colleagues’ approach on this data set was 2.8%.
Wilson et al.  used z scores in the context of multilocus genetic trees to assess the association of genes with the quantitative traits as given and the residuals accounting for Age, Sex, Smoking status, and Ethnicity as identified in two principal components. Significance was assessed using permutation tests within each of the seven identified ethnic populations and assessed at a Bonferroni-corrected p-value of 5 × 10−5. Wilson and colleagues reported that their method was computationally efficient for 100,000 permutations of the 3,205 genes. As was the case with other work groups, Wilson and co-workers were most successful at finding the FLT1 and KDR contributions to Q1 and the VNN1 association with Q2. There were several false-positive associations with Q4. A plot of sensitivity against specificity indicated that the power was about 0.2 for Q1 and 0.05 for Q2 at an uncorrected type I error rate of 0.01.
The pathway-based method used by Cherkas et al.  showed association with the VEGF, hypoxia, and nitric oxide pathways used to simulate the data. The results from the stochastic gradient boosting approach were not described by pathway but by accuracy of SNP identification. As was the case with many other work groups, at the SNP level some true positives were identified among a larger set of false-positive associations.
Yan et al.  compared their top-ranked variants with the GAW17 answers and found that many of the variants they detected were in the simulated trait models. However, their model shrunk the regression coefficients of the rare variants toward 0, so their estimations for the number of associated SNPs for Q1 and Q2 missed many rare variants. Moreover, the analysis of Q4 indicated six associated variants, and given that no genetic determinants were used in the simulation of Q4, this was of concern. However, Yan and colleagues found that there was a −0.29 correlation between Q1 and Q4 that could account for this false-positive rate.
As shown, a direct comparison of the power and type I errors of the eight multiallelic approaches was not a straightforward task for Group 13. However, we were able to draw some useful conclusions regarding the analysis of rare variant data with multiallelic approaches. The first point is that a key contributor to success in detecting associations is using a method that matches the genetic model of inheritance of the trait under analysis. For the GAW17 data, this model was primarily polygenic. The values of quantitative traits Q1 and Q2 were based on the contributions of numerous rare variants, although some common SNPs were also causal. The binary trait was derived from these two quantitative traits. The models are described much more extensively by Almasy et al. . Thus we thought that we could assess the success of the multiallelic methods in the context of those models if we examined the Group 13 detection rate for the two types of genes contributing to the simulated traits. Genes rather than variants were selected because all the methods were designed for multiallelic rather than individual variant data, although the gene was not used consistently as the unit of analysis. For each trait with a genetic contribution, we selected the gene with the largest number of causal rare variants and the gene with the causal common SNP. Both criteria were likely to provide the best chance for detection.
When comparisons were possible, we were pleased to see that detection of the six genes (three traits with two genes each) was fairly consistent across the methods. For the genes with the largest number of rare variants, FLT1 was detected for Q1 by all three work groups that analyzed it, but only one of the four work groups analyzing Q2 detected BCHE (Wilson and co-workers, who used a clustering method). For the four groups analyzing the binary trait, only Bueno Filho et al. , using a Bayesian hierarchical mixture model, reported PIK3C2B. Unfortunately, that group had a large false-positive rate, making their finding less remarkable and more consistent with the others. To address this, Bueno Filho and colleagues suggest that frequency weighting factors in their Bayesian mixture models might become clearer once methods to implement penalties for overfitting are incorporated.
Regarding the three genes with common SNPs for Q1, KDR was selected by Wilson et al.  and Yan et al. , indicating that both methods may be sensitive to SNPs with higher allele frequencies; the method of Christensen and Lambert  did not detect KDR, possibly because of the model they used to select genes for testing. All groups detected VNN1 for Q2, and none of the groups detected HSP90AA1 for the binary trait. This consistency in detection is applicable only to the GAW17 models, and the true model of how rare variants actually contribute to complex traits currently remains an open question. It will be interesting to see how the models used in the GAW17 simulations match what is observed in the analysis of rare variants as a substantial amount of sequence data become available over the next few years.
An additional lesson derived from the work of Group 13 is that decisions regarding the entities to be analyzed should be considered as seriously as the method of analysis. The entities can vary widely and should be constructed to match the presumed models of inheritance and the methods that will be used. As an example of a biological model used by Group 13, Christensen and Lambert  applied biological insight regarding the relevance of haplotyping when they proposed a model of inheritance that required risk genotypes (rare variants on opposite haplotypes) rather than risk alleles (rare variants on the same haplotype) for testing a gene. Although their ability to capitalize on this model could not be addressed with the polygenic GAW17 data, it is reasonable to consider their model when analyzing sequence data. We expect that insights into the best entities will evolve over time as more work is done to investigate the role of rare variants.
Other decisions regarding the definition of the entities (e.g., genes, sliding windows, or pathways) will also affect the results. In its most simple form, the number of entities will require a particular level of significance, thus affecting the adjustment for multiple testing. The decision will also affect coverage of the genes and genome, which can influence the power to detect associations. Perhaps Sykes et al.  are correct in suggesting that to boost power, we should filter genes to select those with associated common SNPs in order to better assess the role of rare variants if there is not adequate power to detect associations with the rare variants alone using multiallelic approaches. Pathways, as illustrated by Cherkas et al. , are beginning to be considered and may provide a useful strategy for rare variant aggregation based on functional implication.
Thus the overarching lesson is that creativity in modeling the genetic and biologic process with the genetic entities and statistical methods is an important factor in the detection of causal rare variants, and the process is not simple.
The Genetic Analysis Workshops are supported by National Institutes of Health (NIH) grant R01 GM031575. RMC is supported in part by the Database and Statistics Core of NIH grant HL-28481. We thank the members of Group 13 for their creative contributions to the workshop and their lively discussions at the meeting.