Advances in DNA sequencing technologies have greatly facilitated the discovery of rare genetic variants in the human genome, many of which may contribute to common disease risk. However, evaluating their individual or even collective effects on disease risk requires very large sample sizes, which involves study designs that are often prohibitively expensive. We present an alternative approach for determining genotypes in large numbers of individuals for all variants discovered in the sequence of relatively few individuals. Specifically, we developed a new imputation algorithm that utilizes whole-exome sequencing data from 25 members of the South Dakota Hutterite population, and genome-wide SNP genotypes from >1400 individuals from the same founder population. The algorithm relies on identity-by-descent sharing of phased haplotypes, a different strategy than the linkage disequilibrium methods found in most imputation methods. We imputed genotypes discovered in the sequence data to on average ~77% of chromosomes among the 1400 individuals. Median R2 between imputed and directly genotyped data was > 0.99. As expected, many variants that are vanishingly rare in European populations have risen to larger frequencies in the founder population, and would be amenable to single-SNP analyses.
Exome sequencing; association study; IBD calculation; complex pedigrees; imputation
We present the most comprehensive comparison to date of the predictive benefit of genetics in addition to currently used clinical variables, using genotype data for 33 single-nucleotide polymorphisms (SNPs) in 1,547 Caucasian men from the placebo arm of the REduction by DUtasteride of prostate Cancer Events (REDUCE®) trial. Moreover, we conducted a detailed comparison of three techniques for incorporating genetics into clinical risk prediction. The first method was a standard logistic regression model, which included separate terms for the clinical covariates and for each of the genetic markers. This approach ignores a substantial amount of external information concerning effect sizes for these Genome Wide Association Study (GWAS)-replicated SNPs. The second and third methods investigated two possible approaches to incorporating meta-analysed external SNP effect estimates – one via a weighted PCa ‘risk’ score based solely on the meta analysis estimates, and the other incorporating both the current and prior data via informative priors in a Bayesian logistic regression model. All methods demonstrated a slight improvement in predictive performance upon incorporation of genetics. The two methods that incorporated external information showed the greatest receiver-operating-characteristic AUCs increase from 0.61 to 0.64. The value of our methods comparison is likely to lie in observations of performance similarities, rather than difference, between three approaches of very different resource requirements. The two methods that included external information performed best, but only marginally despite substantial differences in complexity.
prostate cancer; genetic clinical risk prediction; genetic scores; Bayesian logistic regression; predictive assessment
Accurate genetic association studies are crucial for the detection and the validation of disease determinants. One of the main confounding factors that affect accuracy is population stratification, and great efforts have been extended for the past decade to detect and to adjust for it. We have now efficient solutions for population stratification adjustment for single-SNP (where SNP is single-nucleotide polymorphisms) inference in genome-wide association studies, but it is unclear whether these solutions can be effectively applied to rare variation studies and in particular gene-based (or set-based) association methods that jointly analyze multiple rare and common variants. We examine here, both theoretically and empirically, the performance of two commonly used approaches for population stratification adjustment—genomic control and principal component analysis—when used on gene-based association tests. We show that, different from single-SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent. The inflation in gene-level statistics could be impacted by the number and the allele frequency spectrum of SNPs in the gene, and by the gene-based testing method used in the analysis. As a consequence, using a universal inflation factor as a genomic control should be avoided in gene-based inference with sequencing data. We also demonstrate that caution needs to be exercised when using principal component adjustment because the accuracy of the adjusted analyses depends on the underlying population substructure, on the way the principal components are constructed, and on the number of principal components used to recover the substructure.
sequencing studies; gene-based association test; genomic control; principal component analysis; C-alpha test; burden test
We describe statistical methods that extend the application of admixture mapping from unrelated individuals to nuclear pedigrees, allowing existing pedigree-based collections to be fully exploited. Computational challenges have been overcome by developing a fast algorithm that exploits the factorial structure of the underlying model of ancestry transitions. This has been implemented as an extension of the program ADMIXMAP. We demonstrate the application of the method to a study of sarcoidosis in African Americans that has previously been analyzed only as an admixture mapping study restricted to unrelated individuals. Although the ancestry signals detected in this pedigree analysis are generally similar to those detected in the earlier analysis of unrelated cases, we are able to extract more information and this yields a much sharper exclusion map; using the classical criterion of an LOD score of minus 2, the pedigree analysis is able to exclude a risk ratio of 2 or more associated with African ancestry over 96% of the genome, compared with only 83% in the earlier analysis of unrelated individuals only. Although the pedigree extension of ADMIXMAP can use ancestry-informative markers only at relatively low density, it can use imputed ancestry states from programs such as WINPOP or HAPMIX that use dense SNP marker genotypes for admixture mapping. This extends both the efficiency and the range of application of this powerful gene mapping method.
admixture; ancestry; pedigrees; linkage; sarcoidosis; African American; hidden Markov models
In a commentary on the evolution of the field of genetic epidemiology over the past 10 years in this issue, Khoury et al. highlight several important developments, including the emergence of evaluation of genetic discoveries for their translational utility and of standards for reporting genetic findings. In this companion to their article, I reflect on some of these trends and speculate about the direction of the field in the future. In particular, I emphasize the opportunities posed by novel technologies like next-generation sequencing and the biological insights emerging from integrative genomics, but I also question the utility of large consortia. The basic principles of population-based research and the importance of taking account of the environment remain important to the field.
environment; genome-wide association studies; consortia; next-generation sequencing; epigenetics; genetic testing; translational research; integrative genomics
Population stratification leads to a predictable phenomenon—a reduction in the number of heterozygotes compared to that calculated assuming Hardy-Weinberg Equilibrium (HWE). We show that population stratification results in another phenomenon—an excess in the proportion of spouse-pairs with the same genotypes at all ancestrally informative markers, resulting in ancestrally related positive assortative mating. We use principal components analysis to show that there is evidence of population stratification within the Framingham Heart Study, and show that the first principal component correlates with a North-South European cline. We then show that the first principal component is highly correlated between spouses (r=0.58, p=0.0013), demonstrating that there is ancestrally related positive assortative mating among the Framingham Caucasian population. We also show that the single nucleotide polymorphisms loading most heavily on the first principal component show an excess of homozygotes within the spouses, consistent with similar ancestry-related assortative mating in the previous generation. This nonrandom mating likely affects genetic structure seen more generally in the North American population of European descent today, and decreases the rate of decay of linkage disequilibrium for ancestrally informative markers.
population stratification; non-random mating; Hardy-Weinberg equilibrium
Genomic regions with replicated linkage to asthma-related phenotypes likely harbor multiple susceptibility loci with relatively minor effects on disease susceptibility. The 11q13 chromosomal region has repeatedly been linked to asthma with five genes residing in this region with reported replicated associations. Cortactin, an actin-binding protein encoded by the CTTN gene in 11q13, constitutes a key regulator of cytoskeletal dynamics and contractile cell machinery, events facilitated by interaction with myosin light chain kinase; encoded by MYLK, a gene we recently reported as associated with severe asthma in African Americans. To evaluate potential association of CTTN gene variation with asthma susceptibility, CTTN exons and flanking regions were re-sequenced in 48 non-asthmatic multiethnic samples, leading to selection of nine tagging polymorphisms for case-control association studies in individuals of European and African descent. After ancestry adjustments, an intronic variant (rs3802780) was significantly associated with severe asthma (odds ratio [OR]: 1.71; 95% confidence interval [CI]: 1.20-2.43; p = 0.003) in a joint analysis. Further analyses evidenced independent and additive effects of CTTN and MYLK risk variants for severe asthma susceptibility in African Americans (accumulated OR: 2.93, 95% CI: 1.40-6.13, p = 0.004). These data suggest that CTTN gene variation may contribute to severe asthma and that the combined effects of CTTN and MYLK risk polymorphisms may further increase susceptibility to severe asthma in African Americans harboring both genetic variants.
CTTN; MLCK; cytoskeleton; SNP; asthma
Complex genetic disorders are a result of a combination of genetic and non-genetic factors, all potentially interacting. Machine learning methods hold the potential to identify multi-locus and environmental associations thought to drive complex genetic traits. Decision trees, a popular machine learning technique, offer a computationally low complexity algorithm capable of detecting associated sets of SNPs of arbitrary size, including modern genome-wide SNP scans. However, interpretation of the importance of an individual SNP within these trees can present challenges.
We present a new decision tree algorithm denoted as Bagged Alternating Decision Trees (BADTrees) that is based on identifying common structural elements in a bootstrapped set of ADTrees. The algorithm is order nk2, where n is the number of SNPs considered and k is the number of SNPs in the tree constructed. Our simulation study suggests that BADTrees have higher power and lower type I error rates than ADTrees alone and comparable power with lower type I error rates compared to logistic regression. We illustrate the application of these data using simulated data as well as from the Lupus Large Association Study 1 (7822 SNPs in 3548 individuals). Our results suggest that BADTrees holds promise as a low computational order algorithm for detecting complex combinations of SNP and environmental factors associated with disease.
Machine Learning; Genetic Association; Gene-Gene Interaction; Multi-locus Models
Genetic variants on the X-chromosome could potentially play an important role in some complex traits. However, development of methods for detecting association with X-linked markers has lagged behind that for autosomal markers. We propose methods for case-control association testing with X-chromosome markers in samples with related individuals. Our method, XM, appropriately adjusts for both correlation among relatives and male-female allele copy number differences. Features of XM include: (1) it is applicable to and computationally feasible for completely general combinations of family and case-control designs; (2) it allows for both unaffected controls and controls of unknown phenotype to be included in the same analysis; (3) it can incorporate phenotype information on relatives with missing genotype data; and (4) it adjusts for sex-specific trait prevalence values. We propose two other tests, Xχ and XW, which can also be useful in certain contexts. We derive the best linear unbiased estimator of allele frequency, and its variance, for X-linked markers. In simulation studies with related individuals, we demonstrate the power and validity of the proposed methods. We apply the methods to X-chromosome association analysis of (1) asthma in a Hutterite sample and (2) alcohol dependence in the GAW 14 COGA data. In analysis (1), we demonstrate computational feasibility of XM and the applicability of our robust variance estimator. In analysis (2), we detect significant association, after Bonferroni correction, between alcohol dependence and SNP rs979606 in the MAOA gene, where this gene has previously been found to be associated with substance abuse and antisocial behavior.
X-linked; score test; quasi-likelihood; pedigrees; GWAS
Although comorbidity among complex diseases (e.g., drug dependence syndromes) is well documented, genetic variants contributing to the comorbidity are still largely unknown. The discovery of genetic variants and their interactions contributing to comorbidity will likely shed light on underlying pathophysiological and etiological processes, and promote effective treatments for comorbid conditions. For this reason, studies to discover genetic variants that foster the development of comorbidity represent high-priority research projects, as manifested in the behavioral genetics studies now underway. The yield from these studies can be enhanced by adopting novel statistical approaches, with the capacity of considering multiple genetic variants and possible interactions. For this purpose, we propose a bivariate Mann-Whitney (BMW) approach to unravel genetic variants and interactions contributing to comorbidity, as well as those unique to each comorbid condition. Through simulations, we found BMW outperformed two commonly adopted approaches in a variety of underlying disease and comorbidity models. We further applied BMW to datasets from the Study of Addiction: Genetics and Environment, investigating the contribution of 184 known nicotine dependence (ND) and alcohol dependence (AD) single nucleotide polymorphisms (SNPs) to the comorbidity of ND and AD. The analysis revealed a candidate SNP from CHRNA5, rsl6969968, associated with both ND and AD, and replicated the findings in an independent dataset with a P-value of 1.06 × 10−03.
forward selection; high-order interaction; CHRNA5; nicotine dependence; alcohol dependence
This paper presents a projection regression model (PRM) to assess the relationship between a multivariate phenotype and a set of covariates, such as a genetic marker, age and gender. In the existing literature, a standard statistical approach to this problem is to fit a multivariate linear model to the multivariate phenotype and then use Hotelling’s T2 to test hypotheses of interest. An alternative approach is to fit a simple linear model and test hypotheses for each individual phenotype and then correct for multiplicity. However, even when the dimension of the multivariate phenotype is relatively small, say 5, such standard approaches can suffer from the issue of low statistical power in detecting the association between the multivariate phenotype and the covariates. The PRM generalizes a statistical method based on the principal component of heritability for association analysis in genetic studies of complex multivariate phenotypes. The key components of the PRM include an estimation procedure for extracting several principal directions of multivariate phenotypes relating to covariates and a test procedure based on wild-bootstrap method for testing for the association between the weighted multivariate phenotype and explanatory variables. Simulation studies and an imaging genetic dataset are used to examine the finite sample performance of the PRM.
imaging genetics; multivariate phenotype; projection regression model; single nucleotide polymorphism; wild bootstrap
Next generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, thus improving statistical power for disease-associated variant detection. Several methods for disease association tests of pooled data and for optimal pooling designs have been developed under certain assumptions of the pooling process, e.g. equal/unequal contributions to the pool, sequencing depth variation, and error rate. However, these simplified assumptions may not portray the many factors affecting pooled sequencing data quality, such as PCR amplification during target capture and sequencing, reference allele preferential bias, and others. As a result, the properties of the observed data may differ substantially from those expected under the simplified assumptions. Here, we use real datasets from targeted sequencing of pooled samples, together with microarray SNP genotypes of the same subjects, to identify and quantify factors (biases and errors) affecting the observed sequencing data. Through simulations, we find that these factors have a significant impact on the accuracy of allele frequency estimation and the power of association tests. Furthermore, we develop a workflow protocol to incorporate these factors in data analysis to reduce the potential biases and errors in pooled sequencing data and to gain better estimation of allele frequencies. The workflow, Psafe, is available at http://bioinformatics.med.yale.edu/group/.
pooled sequencing; allele frequency estimation; next-generation sequencing; disease association tests
Detecting uncommon causal variants (minor allele frequency (MAF) < 5%) is difficult with commercial single-nucleotide polymorphism (SNP) arrays that are designed to capture common variants (MAF > 5%). Haplotypes can provide insights into underlying linkage disequilibrium (LD) structure and can tag uncommon variants that are not well tagged by common variants. In this work, we propose a wei-SIMc-matching test that inversely weights haplotype similarities with the estimated standard deviation of haplotype counts, to boost the power of similarity-based approaches for detecting uncommon causal variants. We then compare the power of the wei-SIMc-matching test with that of several popular haplotype-based tests, including four other similarity-based tests, a global score test for haplotypes (global), a test based on the maximum score statistic over all haplotypes (max), and two newly proposed haplotype-based tests for rare variant detection. With systematic simulations under a wide range of LD patterns, the results show that wei-SIMc-matching and global are the two most powerful tests. Among these two tests, wei-SIMc-matching has reliable asymptotic P values, whereas global needs permutations to obtain reliable P values when the frequencies of some haplotype categories are low or when the trait is skewed. Therefore, we recommend wei-SIMc-matching for detecting uncommon causal variants with surrounding common SNPs, in light of its power and computational feasibility.
Haplotype; Similarity; Linkage disequilibrium; Rare variants
Unraveling the nature of genetic interactions is crucial to obtaining a more complete picture of complex diseases. It is thought that gene-gene interactions play an important role in the etiology of cancer, cardiovascular and immune-mediated disease. Interactions among genes are defined as phenotypic effects that differ from those observed for independent contributions of each gene, usually detected by univariate logistic regression methods. Using a multivariate extension of linkage disequilibrium, we have developed a new method, based on distances between sample covariance matrices for groups of SNPs, to test for interaction effects of two groups of genes associated with a disease phenotype. Since a disease-associated interacting locus will often be in linkage disequilibrium with more than one marker in the region, a method that examines a set of markers in a region collectively can offer greater power than traditional methods. Our method effectively identifies interaction effects in simulated data, as well as in data on the genetic contributions to the risk for graft-versus-host disease following hematopoietic cell transplantation.
genetic association studies; epistasis; multivariate analysis
For many clinical studies in cancer, germline DNA is prospectively collected for the purpose of discovering or validating Single Nucleotide Polymorphisms associated with clinical outcomes. The primary clinical endpoint for many of these studies are time-to-event outcomes such as time of death or disease progression which are subject to censoring mechanisms. The Cox score test can be readily employed to test the association between a SNP and the outcome of interest. In addition to the effect and sample size, and censoring distribution, the power of the test will depend on the underlying genetic risk model and the distribution of the risk allele. We propose a rigorous account for power and sample size calculations under a variety of genetic risk models without resorting to the commonly used contiguous alternative assumption. Practical advice along with an open-source software package to design SNP association studies with survival outcomes are provided.
censoring pharmacogenomics; Cox score test; genetic risk; SNP association study
Association analyses may follow an initial linkage analysis for mapping and identifying genes underlying complex quantitative traits and may be conducted on unrelated subsets of individuals where only one member of a family is included. We evaluate two methods to select one sibling per sibship when multiple siblings are available: 1) one sibling with the most extreme trait value; and 2) one sibling using a combination score statistic based on extreme trait values and identity-by-descent sharing information. We compare the type I error and power. Furthermore, we compare these selection strategies with a strategy that randomly selects one sibling per sibship and with an approach that includes all siblings, using both simulation study and an application to fasting blood glucose in the Framingham Heart Study. When genetic effect is homogeneous, we find that using the combination score can increase power by 30 to 40% compared to a random selection strategy, and loses only 8 ~ 13% of power compared to the full sibship analysis, across all additive models considered, but offers at least 50% genotyping cost saving. In the presence of genetic heterogeneity, the score offers a 50% increase in power over a random selection strategy, but there is substantial loss compared to the full sibship analysis. In application to fasting blood sample, two SNPs are found in common for the selection strategies and the full sample among the 10 highest ranked SNPs. The EV strategy tends to agree with the IBD-EV strategy and the analysis of the full sample.
linkage analysis; association study; linkage disequilibrium; identity-by-descent (IBD)
Genome-wide association studies (GWAS) of complex traits have generated many association signals for single nucleotide polymorphisms (SNPs). To understand the underlying causal genetic variant(s), focused DNA resequencing of targeted genomic regions is commonly used, yet the current cost of resequencing limits sample sizes for resequencing studies. Information from the large GWAS can be used to guide choice of samples for resequencing, such as the SNP genotypes in the targeted genomic region. Viewing the GWAS tag-SNPs as imperfect surrogates for the underlying causal variants, yet expecting that the tag-SNPs are correlated with the causal variants, a reasonable approach is a two-phase case-control design, with the GWAS serving as the first-phase and the resequencing study serving as the second-phase. Using stratified sampling based on both tag-SNP genotypes and case-control status, we explore the gains in power of a two-phase design relative to randomly sampling cases and controls for resequencing (i.e., ignoring tag-SNP genotypes). Simulation results show that stratified sampling based on both tag-SNP genotypes and case-control status is not likely to have lower power than stratified sampling based only on case-control status, and can sometimes have substantially greater power. The gain in power depends on the amount of linkage disequilibrium between the tag-SNP and causal variant alleles, as well as the effect size of the causal variant. Hence, the two-phase design provides an efficient approach to follow-up GWAS signals with DNA resequencing.
DNA resequencing; Horwitz-Thompson estimate; inverse sampling fraction weights; two-phase sampling
Sequencing studies using whole-genome or exome scans are still more expensive than genome wide association studies (GWAS) on a per-subject basis. As a result, subsets of subjects from a larger study are often selected for sequencing. To perform an agnostic investigation of the entire genome, subjects may be selected that capture independent ancestral lineages, i.e. founder genomes, and thus avoid redundant information from regions that were inherited identical by descent (IBD) from a common ancestor. We present SampleSeq2 which can be used to select a subset of optimally unrelated subjects with minimal IBD sharing. It also can be used to estimate the number, GT, of founder chromosomes in a sample or select the minimum number of subjects that will carry a target GT. We evaluated SampleSeq2 compared to a random draw of a small number of subjects both by simulation and using the Anabaptist genealogy. SampleSeq2 provided an increase in GT relative to a random draw across a range of small sample sizes. This increase in founder chromosomes improves the power of association tests, mitigates the effect of cryptic relatedness on parameter estimates, increases the total yield of alleles from sequencing, and minimizes the average size of regions shared IBD around disease alleles in cases.
resequencing; ancestry; cryptic relatedness; study design; subject selection
Searching for rare genetic variants associated with complex diseases can be facilitated by enriching for diseased carriers of rare variants by sampling cases from pedigrees enriched for disease, possibly with related or unrelated controls. This strategy, however, complicates analyses because of shared genetic ancestry, as well as linkage disequilibrium among genetic markers. To overcome these problems, we developed broad classes of “burden” statistics and kernel statistics, extending commonly used methods for unrelated case-control data to allow for known pedigree relationships, for autosomes and the X chromosome. Furthermore, by replacing pedigree-based genetic correlation matrices with estimates of genetic relationships based on large-scale genomic data, our methods can be used to account for population-structured data. By simulations, we show that the type I error rates of our developed methods are near the asymptotic nominal levels, allowing rapid computation of P-values. Our simulations also show that a linear weighted kernel statistic is generally more powerful than a weighted “burden” statistic. Because the proposed statistics are rapid to compute, they can be readily used for large-scale screening of the association of genomic sequence data with disease status.
burden test; kernel statistic; rare variants; pedigree data; genome sequence data
Biological plausibility and other prior information could help select genome-wide association (GWA) findings for further follow-up, but there is no consensus on which types of knowledge should be considered or how to weight them. We used experts’ opinions and empirical evidence to estimate the relative importance of 15 types of information at the single nucleotide polymorphism (SNP) and gene levels. Opinions were elicited from ten experts using a two-round Delphi survey. Empirical evidence was obtained by comparing the frequency of each type of characteristic in SNPs established as being associated with seven disease traits through GWA meta-analysis and independent replication, with the corresponding frequency in a randomly selected set of SNPs. SNP and gene characteristics were retrieved using a specially developed bioinformatics tool. Both the expert and the empirical evidence rated previous association in a meta-analysis or more than one study as conferring the highest relative probability of true association, while previous association in a single study ranked much lower. High relative probabilities were also observed for location in a functional protein domain, while location in a region evolutionarily conserved in vertebrates was ranked high by the data but not by the experts. Our empirical evidence did not support the importance attributed by the experts to whether the gene encodes a protein in a pathway or shows interactions relevant to the trait. Our findings provide insight into the selection and weighting of different types of knowledge in SNP or gene prioritization, and point to areas requiring further research.
Gene prioritization; Genome-wide association studies; Bioinformatics databases
Prioritization is the process whereby a set of possible candidate genes or SNPs is ranked so that the most promising can be taken forward into further studies. In a genome-wide association study, prioritization is usually based on the p-values alone, but researchers sometimes take account of external annotation information about the SNPs such as whether the SNP lies close to a good candidate gene. Using external information in this way is inherently subjective and is often not formalized, making the analysis difficult to reproduce. Building on previous work that has identified fourteen important types of external information, we present an approximate Bayesian analysis that produces an estimate of the probability of association. The calculation combines four sources of information: the genome-wide data, SNP information derived from bioinformatics databases, empirical SNP weights, and the researchers’ subjective prior opinions. The calculation is fast enough that it can be applied to millions of SNPS and although it does rely on subjective judgments, those judgments are made explicit so that the final SNP selection can be reproduced. We show that the resulting probability of association is intuitively more appealing than the p-value because it is easier to interpret and it makes allowance for the power of the study. We illustrate the use of the probability of association for SNP prioritization by applying it to a meta-analysis of kidney function genome-wide association studies and demonstrate that SNP selection performs better using the probability of association compared with p-values alone.
replication; prior knowledge; genome-wide studies
Patterns of linkage disequilibrium are often depicted pictorially by using tools that rely on visualizations of raw data or pairwise correlations among individual markers. Such approaches can fail to highlight some of the more interesting and complex features of haplotype structure. To enable natural visual comparisons of haplotype structure across subgroups of a population (e.g. isolated subpopulations or cases and controls), we propose an alternative visualization that provides a novel graphical representation of haplotype frequencies. We introduce Haploscope, a tool for visualizing the haplotype cluster frequencies that are produced by statistical models for population haplotype variation. We demonstrate the utility of our technique by examining haplotypes around the LCT gene, an example of recent positive selection, in samples from the Human Genome Diversity Panel. Haploscope, which has flexible options for annotation and inspection of haplotypes, is available for download at http://scheet.org/software.
Clustering; haplotypes; linkage disequilibrium; software; visualization
Although population differences in gene expression have been established, the impact on differential gene expression studies in large populations is not well understood. We describe the effect of self-reported race on a gene expression study of lung function in asthma. We generated gene expression profiles for 254 young adults (205 non-Hispanic whites and 49 African Americans) with asthma on whom concurrent total RNA derived from peripheral blood CD4+ lymphocytes and lung function measurements were obtained. We identified four principal components that explained 62% of the variance in gene expression. The dominant principal component, which explained 29% of the total variance in gene expression, was strongly associated with self-identified race (P<10−16). The impact of these racial differences was observed when we performed differential gene expression analysis of lung function. Using multivariate linear models, we tested whether gene expression was associated with a quantitative measure of lung function: pre-bronchodilator forced expiratory volume in one second (FEV1). Though unadjusted linear models of FEV1 identified several genes strongly correlated with lung function, these correlations were due to racial differences in the distribution of both FEV1 and gene expression, and were no longer statistically significant following adjustment for self-identified race. These results suggest that self-identified race is a critical confounding covariate in epidemiologic studies of gene expression and that, similar to genetic studies, careful consideration of self-identified race in gene expression profiling studies is needed to avoid spurious association.
ancestry; gene expression; population stratification; self-identified race
The wave of next-generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to non-causal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.
rare variants; sequencing; burden tests
Current genome-wide association studies still heavily rely on a single-marker strategy, in which each single nucleotide polymorphism (SNP) is tested individually for association with a phenotype. Although methods and software packages that consider multimarker models have become available, they have been slow to become widely adopted and their efficacy in real data analysis is often questioned. Based on conducting extensive simulations, here we endeavor to provide more insights into the performance of simple multimarker association tests as compared to single-marker tests. The results reveal the power advantage as well as disadvantage of the two- vs. the single-marker test. Power differentials depend on the correlation structure among tag SNPs, as well as that between tag SNPs and causal variants. A two-marker test has relatively better performance than single-marker tests when the correlation of the two adjacent markers is high. However, using HapMap data, two-marker tests tended to have a greater chance of being less powerful than single-marker tests, due to constraints on the number of actual possible haplotypes in the HapMap data. Yet, the average power difference was small whenever the one-marker test is more powerful, while there were many situations where the two-marker test can be much more powerful. These findings can be useful to guide analyses of future studies.
Asymptotic power; single-marker test; two-marker test; genome-wide association