Genome-wide association (GWA) studies have reported susceptible regions in the human genome for many common diseases and traits, however, these loci only explain a minority of trait heritability. To boost the power of a GWA study, substantial research endeavors have been focused on integrating other available genomic information in the analysis. Advances in high through-put technologies have generated a wealth of genomic data, and made combining SNP and gene expression data become feasible.
In this paper we propose a novel procedure to incorporate gene expression information into GWA analysis. This procedure utilizes weights constructed by gene expression measurements to adjust p values from a GWA analysis. Results from simulation analyses indicate that the proposed procedures may achieve substantial power gains while controlling family-wise type I error rate (FWER) at the nominal level. To demonstrate the implementation of our proposed approach, we apply the weight adjustment procedure to a GWA study for serum interferon-regulated chemokine levels in systemic lupus erythematosus (SLE) patients. The study results can provide valuable insights for the functional interpretation of GWA signals.
The R source code for implementing the proposed weighting procedure is available at http://www.biostat.umn.edu/~yho/research.html
p value weighting; family-wise error rate; statistical power; integrative genomic analysis; SLE
We propose in this paper a set-valued (SV) system model, which is a generalized form of Logistic (LG) and Probit (Probit) regression, to be considered as a method for discovering genetic variants, especially rare genetic variants in next generation sequencing studies, for a binary phenotype. We propose a new set-valued system identification method to estimate all the underlying key system parameters for the Probit model and compare it with the LG model in the setting of genetic association studies. Across an extensive series of simulation studies, the Probit method maintained Type I error control and had similar or greater power than the LG method which is robust to different distributions of noise: logistic, normal or t distributions. Additionally, the Probit association parameter estimate was 2.7–46.8 fold less variable than the LG log-odds ratio association parameter estimate. Less variability in the association parameter estimate translates to greater power and robustness across the spectrum of minor allele frequencies (MAFs), and these advantages are the most pronounced for rare variants. For instance, in a simulation that generated data from an additive logistic model with odds ratio of 7.4 for a rare single nucleotide polymorphism with a MAF of 0.005 and a sample size of 2300, the Probit method had 60% power whereas the LG method had 25% power at the α=10−6 level. Consistent with these simulation results, the set of variants identified by the LG method was a subset of those identified by the Probit method in two example analyses. Thus, we suggest the Probit method may be a competitive alternative to the LG method in genetic association studies such as candidate gene, genome-wide, or next generation sequencing studies for a binary phenotype.
Set-valued system model; binary phenotype; threshold model; genetic variants; rare variants; next-generation sequencing studies
Culturally-driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation.
Using genotypes at 645 microsatellites, we compare inbreeding coefficients from 5,043 individuals representing 237 worldwide populations to demographic consanguinity frequency estimates available for 26 populations, as well as to other quantities that can illuminate population-genetic influences on inbreeding coefficients.
We observe higher inbreeding coefficient estimates in populations and geographic regions with known high levels of consanguinity or genetic isolation, and in populations with an increased effect of genetic drift and decreased genetic diversity with increasing distance from Africa. For the small number of populations with specific consanguinity estimates, we find a correlation between inbreeding coefficients and consanguinity frequency (r=0.349, P=0.040).
The results emphasize the importance of both consanguinity and population-genetic factors in influencing variation in inbreeding coefficients, and they provide insight into factors useful for assessing the effect of consanguinity on genomic patterns in different populations.
consanguinity; homozygosity; identity by descent; inbreeding; short tandem repeats
The kinship2 package is a restructured from the previous kinship package. Existing features are now enhanced and new features added for handling pedigree objects.
Pedigree plotting features have been updated to display features on complex pedigrees while adhering to pedigree plotting standards. Kinship matrices can now be calculated for the X chromosome. Other methods have been added to subset and trim pedigrees while maintaining the pedigree structure.
We make the kinship2 package available for R on the Contributed R Archives Network (CRAN), where data management is built-in and other packages can use the pedigree object.
pedigrees; genetic linkage analysis; kinship; graphics
The incorporation of gene-environment interactions could improve the ability to detect genetic associations with complex traits. For common genetic variants, single marker interaction test and joint test of genetic main effects and gene-environment interaction have been well established and used to identify novel association loci for complex diseases and continuous traits. For rare genetic variants, however, single marker tests are severely underpowered due to the low minor allele frequency, and only a few gene-environment interaction tests have been developed. We aim at developing powerful and computationally efficient tests for gene-environment interaction with rare variants.
In this paper, we propose interaction and joint tests for testing gene-environment interaction of rare genetic variants. Our approach is a generalization of existing gene-environment interaction tests for multiple genetic variants under certain conditions.
We show in our simulation studies that our interaction and joint tests have correct type I errors, and that the joint test is a powerful approach for testing genetic association, allowing for gene-environment interaction. We also illustrate our approach in a real data example from the Framingham Heart Study.
Our approach can be applied to both binary and continuous traits, and is powerful and computationally efficient.
rare variant analysis; gene-environment interaction; sequence kernel association test; joint test; generalized linear mixed model
A particular approach to visualization of descent of founder DNA copies in a pedigree has been suggested, which helps to understand haplotype sharing patterns among subjects of interest. However, the approach does not provide the information in an ideal format to show haplotype-sharing patterns. Therefore, we aimed to find an efficient way to visualize such sharing patterns, to demonstrate that our tool provides useful information for finding an informative subset of subjects for a sequence study.
The visualization package, SharedHap, computes and visualizes a novel metric, the SharedHap proportion, which quantifies haplotype-sharing among a set of subjects of interest. We applied SharedHap to simulated and real pedigree datasets to illustrate the approach.
SharedHap successfully represents haplotype-sharing patterns that contribute to linkage signals in both simulated and real datasets. Using the visualizations we were also able to find ideal sets of subjects for sequencing studies.
Our novel metric that can be computed using the SharedHap package provides useful information about haplotype-sharing patterns among subjects of interest. The visualization of the SharedHap proportion provides useful information in pedigree studies, allowing for better selection of candidate subjects for use in further sequencing studies.
IBD; Linkage analysis; Gene mapping; Sampling subjects; Sequencing; Sequence
Gene-Gene interactions (GxG) are important to study because of their extensiveness in biological systems and their potential in explaining missing heritability of complex traits. In this work, we propose a new similarity-based test to assess GxG at gene level, which permits the study of epistasis at biologically functional units with amplified interaction signals.
Under the framework of gene-trait similarity regression (SimReg), we propose a gene-based test for detecting gene-gene interactions. SimReg uses a regression model to correlate trait similarity with genotypic similarity across a gene. Unlike existing gene-level methods based on leading principal components (PCs), SimReg summarizes all information on genotypic variation within a gene and can be used to assess the joint/interactive effects of two genes as well as the effect of one gene conditional on another.
Using simulations and a real data application on warfarin study, we show that the SimReg GxG tests have satisfactory power and robustness under different genetic architecture when compared to existing gene-based interaction tests such as PC analysis or partial least squares (PLS). A genomewide association study with ~20,000 genes may be completed on a parallel computing system in 2 weeks.
Linkage analysis can help determine regions of interest in whole genome
sequence studies. However, many linkage studies rely on older microsatellite (MSAT)
panels. We set out to determine whether results would change if we regenotyped families
using a dense map of SNPs.
We selected 47 Hispanic-American families from the NIMH Repository and Genomics
Resource (NRGR) schizophrenia data repository. We regenotyped all individuals with DNA
available from the NRGR on the Affymetrix Lat Array. After optimizing SNP selection for
inclusion on the linkage map, we compared information content (IC) and linkage results
using MSAT, SNP and MSAT+SNP maps.
As expected, SNP provided higher average IC (0.78, s.d. 0.03) than MSAT (0.51,
s.d. 0.10), in a direct “apples-to-apples” comparison using only
individuals genotyped on both platforms; while MSAT+SNP provided only slightly
higher IC (0.82, s.d. 0.03). However, when utilizing all available individuals,
including those who had available genotypes on only one platform, IC was substantially
increased using MSAT+SNP (0.76, s.d. 0.05) compared to SNP (0.61, s.d. 0.02).
Linkage results changed appreciably between MSAT and MSAT+SNP, in terms of
magnitude, rank ordering and localization of peaks.
Regenotyping older family data can substantially alter the conclusions of
Linkage analysis; Data repositories; microsatellites; SNPs; Information Content; PPL
The study of rare variants, which can potentially explain a great proportion of heritability, has emerged as an important topic in human gene mapping of complex diseases. Although several statistical methods have been developed to increase the power to detect disease-related rare variants, none of these methods address an important issue that often arises in genetic studies: false positives due to population stratification. Using simulations, we investigated the impact of population stratification on false-positive rates of rare-variant association tests.
We simulated a series of case-control studies assuming various sample sizes and levels of population structure. Using such data, we examined the impact of population stratification on rare-variant collapsing and burden tests of rare variation. We further evaluated the ability of two existing methods (principal component analysis and genomic control) to correct for stratification in such rare-variant studies.
We found that population stratification can have a significant influence on studies of rare variants especially when sample size is large and the population is severely stratified. Our results showed that principal component analysis performed quite well in most situations while genomic control often yielded conservative results.
Our results imply that researchers need to carefully match cases and controls on ancestry in order to avoid false positive caused by population structure in studies of rare variants, particularly if genome-wide data are not available.
Rare Variants; Population Stratification; Genomic Control; Principal Component Analysis
We propose and compare methods of analysis for detecting associations between genotypes of a single nucleotide polymorphism (SNP) and a dichotomous secondary phenotype (X), when the data arise from a case-control study of a primary dichotomous phenotype (D), which is not rare. We considered both a dichotomous genotype (G) as in recessive or dominant models, and an additive genetic model based on the number of minor alleles present. To estimate the log odds ratio, β1, relating X to G in the general population, one needs to understand the conditional distribution [D∣X,G], in the general population. For the most general model, [D∣X,G], one needs external data on P(D=1) to estimate β1. We show that for this “full model”, maximum likelihood (FM) corresponds to a previously proposed weighted logistic regression (WL) approach if G is dichotomous. For the additive model, WL yields results numerically close, but not identical, to those of the maximum likelihood, FM. Efficiency can be gained by assuming that [D∣X,G] is a logistic model with no interaction between X and G (the “reduced model”). However, the resulting maximum likelihood (FM) can be misleading in the presence of interactions. We therefore propose an adaptively weighted approach (AW) that captures the efficiency of RM but is robust to the occasional SNP that might interact with the secondary phenotype to affect risk of the primary disease. We study the robustness of FM, WL, RM and AW to misspecification of P(D=1). In principle, one should be able to estimate β1 without external information on P(D=1) under the reduced model. However, our simulations show that the resulting inference is unreliable. Therefore, in practice one needs to introduce external information on P(D=1), even in the absence of interactions between X and G.
adaptively weighted; case-control study; genome-wide association study; maximum likelihood; secondary phenotype
Biological pathways provide rich information and biological context on the genetic causes of complex diseases. The logistic kernel machine test integrates prior knowledge on pathways in order to analyze data from genome-wide association studies (GWAS). Here, the kernel converts genomic information of two individuals to a quantitative value reflecting their genetic similarity. With the selection of the kernel one implicitly chooses a genetic effect model. Like many other pathway methods, none of the available kernels accounts for topological structure of the pathway or gene-gene interaction types. However, evidence indicates that connectivity and neighborhood of genes are crucial in the context of GWAS, because genes associated with a disease often interact. Thus, we propose a novel kernel that incorporates the topology of pathways and information on interactions. Using simulation studies, we demonstrate that the proposed method maintains the type I error correctly and can be more effective in the identification of pathways associated with a disease than non-network-based methods. We apply our approach to genome-wide association case control data on lung cancer and rheumatoid arthritis. We identify some promising new pathways associated with these diseases, which may improve our current understanding of the genetic mechanisms.
Kernel Machine Test; Pathways; Networks; Gene-Gene Interactions; Score Test; Generalized Linear Model; Lung Cancer; Rheumatoid Arthritis; Disease Association; Genetic Association Studies
The present study identified genetic predictors of weight change during behavioral weight loss treatment.
Participants were 3,899 overweight/obese individuals with type 2 diabetes from Look AHEAD, a randomized controlled trial to determine the effects of intensive lifestyle intervention (ILI), including weight loss and physical activity, relative to diabetes support and education, on cardiovascular outcomes. Analyses focused on associations of single nucleotide polymorphisms (SNPs) on the Illumina CARe iSelect (IBC) chip (minor allele frequency >5%; n = 31,959) with weight change at year 1 and year 4, and weight regain at year 4, among individuals who lost ≥ 3% at year 1.
Two novel regions of significant chip-wide association with year-1 weight loss in ILI were identified (p < 2.96E-06). ABCB11 rs484066 was associated with 1.16 kg higher weight per minor allele at year 1, whereas TNFRSF11A, or RANK, rs17069904 was associated with 1.70 kg lower weight per allele at year 1.
This study, the largest to date on genetic predictors of weight loss and regain, indicates that SNPs within ABCB11, related to bile salt transfer, and TNFRSF11A, implicated in adipose tissue physiology, predict the magnitude of weight loss during behavioral intervention. These results provide new insights into potential biological mechanisms and may ultimately inform weight loss treatment.
Type 2 diabetes; Obesity; Weight loss; Diet; Genetics
A gene-based genome-wide association study (GWAS) provides a powerful alternative to the traditional single SNP association analysis due to its substantial reduction in the multiple testing burden and possible gain in power due to modeling multiple SNPs within a gene. A gene-based association analysis on multivariate traits is often of interest, but imposes substantial analytical as well as computational challenges to implement it at a genome-wide level.
We have proposed a rapid implementation of multivariate multiple linear regression approach (RMMLR) in unrelated individuals as well as in families. Our approach allows for covariates. Moreover the asymptotic distribution of the test statistic is not heavily influenced by the linkage disequilibrium (LD) among the SNPs and hence can be used efficiently to perform a gene-based GWAS. We have developed corresponding R package to implement such multivariate gene-based GWAS with this RMMLR approach.
We compare through extensive simulation several approaches for both single and multivariate traits. Our RMMLR maintains correct type-I error level even for set of SNPs in strong LD. It also has substantial gain in power to detect a gene when it is associated with a subset of the traits. We have also studied their performance on Minnesota Center for Twin Family Research dataset.
In our overall comparison, our RMMLR approach provides an efficient and powerful tool to perform a gene-based GWAS with single or multivariate traits and maintains the type I error appropriately.
Multivariate regression; Gene-based genome-wide association studies; Multivariate trait
The process of the colonization of the New World that occurred centuries ago served as a natural experiment, creating unique combinations of genetic material in newly formed admixed populations. The identification and genotyping of ancestry informative markers (AIMs) have allowed for the estimation of proportions of ancestral parental populations among individuals in a sample through the genetic admixture approach. These admixture estimates have been used in different ways to understand the genetic contributions to individual variation in obesity and body composition parameters, particularly among diverse admixed groups known to differ in obesity prevalence within the United States. Although progress has been made through the use of genetic admixture approaches, future investigations are needed in order to explore the interaction of environmental factors with the degree of genetic ancestry in individuals. A challenge to confront at this time would be to further stratify and define environments in progressively more granular terms, including nutrients, muscle biology, stress responses at the cellular level, and the social and built environments.
Genetic admixture; obesity; body composition; race/ethnicity; Ancestry Informative Markers
Obesity is a major contributor to the global burden of chronic disease and disability, though current knowledge of causal biologic underpinnings is lacking. Through the regulation of energy homeostasis and interactions with adiposity and gut signals, the brain is thought to play a significant role in the development of this disorder. While neuroanatomic variation has been associated with obesity, it is unclear if this relationship is influenced by common genetic mechanisms. In this study, we sought genetic components that influence both brain anatomy and body mass index (BMI) to provide further insight into the role of the brain in energy homeostasis and obesity.
MRI images of brain anatomy were acquired in 839 Mexican American individuals from large extended pedigrees. Bivariate linkage and quantitative analyses were performed in SOLAR.
Genetic factors associated with increased BMI were also associated with reduced cortical surface area and subcortical volume. We identified two genome-wide quantitative trait loci that influenced BMI and ventral diencephalon volume, and BMI and supramarginal gyrus surface area, respectively.
This study represents the first genetic analyses seeking evidence of pleiotropic effects acting on both brain anatomy and BMI. Results suggest that a region on chromosome 17 contributes to the development of obesity, potentially through leptin-induced signaling in the hypothalamus, and that a region on chromosome 3 appears to jointly influences food-related reward circuitry and the supramarginal gyrus.
BMI; obesity; imaging; brain; pleiotropy
Genome-wide association studies (GWAS) have led to the identification of single nucleotide polymorphisms in or near several loci that are associated with the risk of obesity and nonalcoholic fatty liver disease (NAFLD). We hypothesized that missense variants in GWAS and related candidate genes may underlie cases of extreme obesity and NAFLD-related cirrhosis, an extreme manifestation of NAFLD.
We performed whole-exome sequencing on 6 Caucasian patients with extreme obesity [mean body mass index (BMI) 84.4] and 4 obese Caucasian patients (mean BMI 57.0) with NAFLD-related cirrhosis.
Sequence analysis was performed on 24 replicated GWAS and selected candidate obesity genes and 5 loci associated with NAFLD. No missense variants were identified in 19 of the 29 genes analyzed, although all patients carried at least 2 missense variants in the remaining genes without excess homozygosity. One patient with extreme obesity carried 2 novel damaging mutations in BBS1 and was homozygous for benign and damaging MC3R variants. In addition, 1 patient with NAFLD-related cirrhosis was compound heterozygous for rare damaging mutations in PNPLA3.
These results indicate that analyzing candidate loci previously identified by GWAS analyses using whole-exome sequencing is an effective strategy to identify potentially causative missense variants underlying extreme obesity and NAFLD-related cirrhosis.
Extreme obesity; Nonalcoholic fatty liver disease; Cirrhosis; Genome-wide association studies; Whole-exome sequencing
To quantify the extent to which the increase in obesity observed across recent generations of the American population is associated with the individual or combined effects of assortative mating for body mass index (BMI; kg/m2) and differential realized fertility by BMI.
A Monte Carlo framework is formed and informed using data collected from the National Longitudinal Survey of Youth (NLSY). The model has two portions, one that generates childbirth events on an annual basis and another that produces a BMI for each child. Once the model is informed using the data, a reference distribution of offspring BMIs is simulated. We quantify the effects of our factors of interest by removing them from the model and comparing the resulting offspring BMI distributions with that of the baseline scenario.
An association between maternal BMI and number of offspring is evidenced in the NLSY data, as well as the presence of assortative mating. These two factors combined are associated with increased mean BMI (+0.067, C.I. [0.056, 0.078]), increased BMI variance (+0.578, C.I. [0.418, 0.736]) and increased prevalence of obesity (RR 1.032, 95% C.I. [1.023, 1.041]) and BMIs over 40 (RR 1.083, 95% C.I. [1.053, 1.118]) among offspring.
Our investigation suggests that both differential realized fertility and assortative mating by BMI appear to play a role in the increasing prevalence of obesity in America.
Obesity; Body Mass Index; Assortative Mating; Realized Fertility; Monte Carlo Simulation
To test the hypothesis that the statistical effect of obesity-related genetic variants on adulthood adiposity traits depends on birth year.
The study sample included 907 related, non-Hispanic White participants in the Fels Longitudinal Study, born between 1901 and 1986, and aged 25–64.99 years (474 females; 433 males) at the time of measurement. All had both genotype data from which a genetic risk score (GRS) composed of 32 well-replicated obesity-related common single nucleotide polymorphisms was created, and phenotype data (including body mass index (BMI), waist circumference, and the sum of four subcutaneous skinfolds. Maximum likelihood-based variance components analysis was used to estimate trait heritabilities, main effects of GRS and birth year, GRS-by-birth year interaction, sex, and age.
Positive GRS-by-birth year interaction effects were found for BMI (p<0.001), waist circumference (p=0.007), and skinfold thickness (p<0.007). For example, each one-allele increase in GRS was estimated to result in a 0.16 kg/m2 increase in BMI among males born in 1930 compared to a 0.47 kg/m2 increase among those born in 1970.
These novel findings suggest the influence of common obesity susceptibility variants has increased during the obesity epidemic.
gene; genetic; heritability; risk score; obesity; BMI; adiposity; waist circumference; interaction; gene-by-environment interaction; secular trend; single nucleotide polymorphism (SNP)
Type 2 diabetes (T2DM) is a complex metabolic disease and is more prevalent in certain ethnic groups such as the Mexican Americans. The goal of our study was to perform a genome-wide linkage analysis to localize T2DM susceptibility loci in Mexican Americans.
We used the phenotypic and genotypic data from 1,122 Mexican American individuals (307 families) who participated in the Veterans Administration Genetic Epidemiology Study (VAGES). Genome-wide linkage analysis was performed, using the variance components approach. Data from two additional Mexican American family studies, the San Antonio Family Heart Study (SAFHS) and the San Antonio Family Diabetes/Gallbladder Study (SAFDGS), were combined with the VAGES data to test for improved linkage evidence.
After adjusting for covariate effects, T2DM was found to be under significant genetic influences (h2 = 0.62, P = 2.7 × 10−6). The strongest evidence for linkage of T2DM occurred between markers D9S1871 and D9S2169 on chromosome 9p24.2-p24.1 (LOD = 1.8). Given that we previously reported suggestive evidence for linkage of T2DM at this region in SAFDGS also, we found the significant and increased linkage evidence (LOD = 4.3, empirical P = 1.0 × 10−5, genome-wide P = 1.6 × 10−3) for T2DM at the same chromosomal region when we performed genome-wide linkage analysis of the VAGES data combined with SAFHS and SAFDGS data.
Significant T2DM linkage evidence was found on chromosome 9p24 in Mexican Americans. Importantly, the chromosomal region of interest in this study overlaps with several recent genome-wide association studies (GWASs) involving T2DM related traits. Given its overlap with such findings and our own initial T2DM association findings in the 9p24 chromosomal region, high throughput sequencing of the linked chromosomal region could identify the potential causal T2DM genes.
Type 2 diabetes; Linkage; Chromosome 9p24; Mexican Americans; VAGES
Custom genotyping of markers in families with Familial Idiopathic Scoliosis (FIS) were used to fine-map candidate regions on chromosomes 9 and 16 in order to identify candidate genes that contribute to this disorder and prioritize them for next generation sequence analysis.
Candidate regions on 9q and 16p–16q, previously identified as linked to FIS in a study of 202 families, were genotyped with a high-density map of single nucleotide polymorphisms (SNPs). Tests of linkage for fine-mapping and intra-familial tests of association, including tiled regression, were performed on scoliosis as both a qualitative and quantitative trait.
Results and Conclusions
Nominally significant linkage results were found for markers in both candidate regions. Results from intra-familial tests of association and tiled regression corroborated the linkage findings and identified possible candidate genes suitable for follow-up with next generation sequencing in these same families. Candidate genes that met our prioritization criteria included FAM129B and CERCAM on chromosome 9 and SYT1, GNAO1, and CDH3 on chromosome 16.
idiopathic scoliosis; chromosome 9q; chromosome 16; genetic heterogeneity; genetics; association; family-based association study; complex disease
A chronic disease such as asthma is the result of a complex sequence of biological interactions involving multiple genes and pathways in response to a multitude of environmental exposures. However, methods to model jointly all factors are still evolving. Some of the current challenges include how to integrate knowledge from different data types and different disciplines, as well as how to utilize relevant external information such as gene annotation to identify novel disease genes and gene-environment interactions.
Using a Bayesian hierarchical modeling framework, we developed two alternative methods for joint analysis of an epidemiologic study of a disease endpoint and an experimental study of intermediate phenotypes, while incorporating external information.
Our simulation studies demonstrated superior performance of the proposed hierarchical models compared to separate analysis with the standard single-level regression modeling approach. The combined analyses of the Southern California Children's Health Study and challenge study data suggest that these joint analytical methods detected more significant genetic main and gene-environment interaction effects than the conventional analysis.
The proposed prior framework is very flexible and can be generalized for an integrative analysis of diverse sources of relevant biological data.
Bayesian hierarchical modeling; Biological related studies; Data integration; Gene-environment interaction; Joint analysis; Markov-chain Monte Carlo (MCMC) methods; Prior knowledge
Given the increasing scale of rare variant association studies, we introduce a method for high-dimensional studies that integrates multiple sources of data as well as allows for multiple region-specific risk indices.
Our method builds upon the previous Bayesian risk index (BRI) by integrating external biological variant-specific covariates to help guide the selection of associated variants and regions. Our extension also incorporates a second-level of uncertainty as to which regions are associated with the outcome of interest.
Using a set of study-based simulations, we show that our approach leads to an increase in power to detect true associations in comparison to several commonly used alternatives. Additionally, the method provides multi-level inference at the pathway, region and variant levels.
To demonstrate the flexibility of the method to incorporate various types of information and the applicability to a high-dimensional data, we apply our method to a single region within a candidate gene study of second primary breast cancer and to multiple regions within a candidate pathway study of colon cancer.
genetic association studies; Bayesian model uncertainty; Bayes factors; sequence analysis; rare variant analysis
A major concern of resequencing studies is that the pathogenicity of most mutations is difficult to predict. To address this concern, linkage (i.e. co-segregation) is often used to exclude mutations, and to better predict pathogenicity among the candidate mutations that remain. However, when linkage disequilibrium (LD) is present in the population but ignored in the analysis, unlinked regions of high LD can provide false evidence for linkage. As a result, the type 1 error of most linkage tests can be inflated, and thousands of neutral mutations may be mistakenly included in a follow-up resequencing study. To illustrate the need for concern, we simulated data on a sparsely spaced panel of SNPs (average spacing 1.27 cM) using an LD pattern estimated from real data. In our simulations, we find that the type 1 error of the maximum LOD can be as high as 14%. Therefore, to control the type 1 error of linkage tests across a wide range of study designs, we created Haplodrop—a fast and flexible simulation program that generates the haplotypes of founders with LD and then ‘drops’ these haplotypes with recombination to all non-founders in the pedigree. Haplodrop agrees well with existing software, accommodates arbitrary pedigree structures, and scales easily to the whole genome. Moreover, by correctly excluding mutations that lie in unlinked regions of high LD, Haplodrop should help reduce the multiple testing burden of resequencing studies.
Type I error; linkage analysis; next-generation sequencing; linkage disequilibrium
Identifying drivers of complex traits from the noisy signals of genetic variation obtained from high throughput genome sequencing technologies is a central challenge faced by human geneticists today. We hypothesize that the variants involved in complex diseases are likely to exhibit non-neutral evolutionary signatures. Uncovering the evolutionary history of all variants is therefore of intrinsic interest for complex disease research. However, doing so necessitates the simultaneous elucidation of the targets of natural selection and population-specific demographic history.
Here we characterize the action of natural selection operating across complex disease categories, and use population genetic simulations to evaluate the expected patterns of genetic variation in large samples. We focus on populations that have experienced historical bottlenecks followed by explosive growth (consistent with most human populations), and describe the differences between evolutionarily deleterious mutations and those that are neutral.
Genes associated with several complex disease categories exhibit stronger signatures of purifying selection than non-disease genes. In addition, loci identified through genome-wide association studies of complex traits also exhibit signatures consistent with being in regions recurrently targeted by purifying selection. Through simulations, we show that population bottlenecks and rapid growth enables deleterious rare variants to persist at low frequencies just as long as neutral variants, but low frequency and common variants tend to be much younger than neutral variants. This has resulted in a large proportion of modern-day rare alleles that have a deleterious effect on function, and that potentially contribute to disease susceptibility.
The key question for sequencing-based association studies of complex traits is how to distinguish between deleterious and benign genetic variation. We used population genetic simulations to uncover patterns of genetic variation that distinguish these two categories, especially derived allele age, thereby providing inroads into novel methods for characterizing rare genetic variation driving complex diseases.
Natural selection; deleterious; simulation; population genetics; rare variants