Genome-wide association studies (GWAS) based on linkage disequilibrium (LD) provide a promising tool for the detection and fine mapping of quantitative trait loci (QTL) underlying complex agronomic traits. In this study we explored the genetic basis of variation for the traits heading date, plant height, thousand grain weight, starch content and crude protein content in a diverse collection of 224 spring barleys of worldwide origin. The whole panel was genotyped with a customized oligonucleotide pool assay containing 1536 SNPs using Illumina's GoldenGate technology resulting in 957 successful SNPs covering all chromosomes. The morphological trait "row type" (two-rowed spike vs. six-rowed spike) was used to confirm the high level of selectivity and sensitivity of the approach. This study describes the detection of QTL for the above mentioned agronomic traits by GWAS.
Population structure in the panel was investigated by various methods and six subgroups that are mainly based on their spike morphology and region of origin. We explored the patterns of linkage disequilibrium (LD) among the whole panel for all seven barley chromosomes. Average LD was observed to decay below a critical level (r2-value 0.2) within a map distance of 5-10 cM. Phenotypic variation within the panel was reasonably large for all the traits. The heritabilities calculated for each trait over multi-environment experiments ranged between 0.90-0.95. Different statistical models were tested to control spurious LD caused by population structure and to calculate the P-value of marker-trait associations. Using a mixed linear model with kinship for controlling spurious LD effects, we found a total of 171 significant marker trait associations, which delineate into 107 QTL regions. Across all traits these can be grouped into 57 novel QTL and 50 QTL that are congruent with previously mapped QTL positions.
Our results demonstrate that the described diverse barley panel can be efficiently used for GWAS of various quantitative traits, provided that population structure is appropriately taken into account. The observed significant marker trait associations provide a refined insight into the genetic architecture of important agronomic traits in barley. However, individual QTL account only for a small portion of phenotypic variation, which may be due to insufficient marker coverage and/or the elimination of rare alleles prior to analysis. The fact that the combined SNP effects fall short of explaining the complete phenotypic variance may support the hypothesis that the expression of a quantitative trait is caused by a large number of very small effects that escape detection. Notwithstanding these limitations, the integration of GWAS with biparental linkage mapping and an ever increasing body of genomic sequence information will facilitate the systematic isolation of agronomically important genes and subsequent analysis of their allelic diversity.
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n>100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Nowadays, the availability of cheaper and accurate assays to quantify multiple (endo)phenotypes in large population cohorts allows multi-trait studies. However, these studies are limited by the lack of flexible models integrated with efficient computational tools for genome-wide multi SNPs-traits analyses. To overcome this problem, we propose a novel Bayesian analysis strategy and a new algorithmic implementation which exploits parallel processing architecture for fully multivariate modeling of groups of correlated phenotypes at the genome-wide scale. In addition to increased power of our algorithm over alternative Bayesian and well-established non-Bayesian multi-phenotype methods, we provide an application to a real case study of several blood lipid traits, and show how our method recovered most of the major associations and is better at refining multi-trait polygenic associations than alternative methods. We reveal and replicate in independent cohorts new associations with two phenotypic groups that were not detected by competing multivariate approaches and not noticed by a large meta-GWAS. We also discuss the applicability of the proposed method to large meta-analyses involving hundreds of thousands of individuals and to diverse genomic datasets where complex dependencies in the predictor space are present.
Association mapping is a powerful approach for dissecting the genetic architecture of complex quantitative traits using high-density SNP markers in maize. Here, we expanded our association panel size from 368 to 513 inbred lines with 0.5 million high quality SNPs using a two-step data-imputation method which combines identity by descent (IBD) based projection and k-nearest neighbor (KNN) algorithm. Genome-wide association studies (GWAS) were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model (MLM) and a new method, the Anderson-Darling (A-D) test. Ten loci for five traits were identified using the MLM method at the Bonferroni-corrected threshold −log10 (P) >5.74 (α = 1). Many loci ranging from one to 34 loci (107 loci for plant height) were identified for 17 traits using the A-D test at the Bonferroni-corrected threshold −log10 (P) >7.05 (α = 0.05) using 556809 SNPs. Many known loci and new candidate loci were only observed by the A-D test, a few of which were also detected in independent linkage analysis. This study indicates that combining IBD based projection and KNN algorithm is an efficient imputation method for inferring large missing genotype segments. In addition, we showed that the A-D test is a useful complement for GWAS analysis of complex quantitative traits. Especially for traits with abnormal phenotype distribution, controlled by moderate effect loci or rare variations, the A-D test balances false positives and statistical power. The candidate SNPs and associated genes also provide a rich resource for maize genetics and breeding.
Genotype imputation has been used widely in the analysis of genome-wide association studies (GWAS) to boost power and fine-map associations. We developed a two-step data imputation method to meet the challenge of large proportion missing genotypes. GWAS have uncovered an extensive genetic architecture of complex quantitative traits using high-density SNP markers in maize in the past few years. Here, GWAS were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model and a new method, the Anderson-Darling (A-D) test. We intend to show that the A-D test is a complement to current GWAS methods, especially for complex quantitative traits controlled by moderate effect loci or rare variations and with abnormal phenotype distribution. In addition, the traits associated QTL identified here provide a rich resource for maize genetics and breeding.
Recent advances in high-throughput genotyping and transcript profiling technologies have enabled the inexpensive production of genome-wide dense marker maps in tandem with huge amounts of expression profiles. These large-scale data encompass valuable information about the genetic architecture of important phenotypic traits. Comprehensive models that combine molecular markers and gene transcript levels are increasingly advocated as an effective approach to dissecting the genetic architecture of complex phenotypic traits. The simultaneous utilization of marker and gene expression data to explain the variation in clinical quantitative trait, known as clinical quantitative trait locus (cQTL) mapping, poses challenges that are both conceptual and computational. Nonetheless, the hierarchical Bayesian (HB) modeling approach, in combination with modern computational tools such as Markov chain Monte Carlo (MCMC) simulation techniques, provides much versatility for cQTL analysis. Sillanpää and Noykova (2008) developed a HB model for single-trait cQTL analysis in inbred line cross-data using molecular markers, gene expressions, and marker-gene expression pairs. However, clinical traits generally relate to one another through environmental correlations and/or pleiotropy. A multi-trait approach can improve on the power to detect genetic effects and on their estimation precision. A multi-trait model also provides a framework for examining a number of biologically interesting hypotheses. In this paper we extend the HB cQTL model for inbred line crosses proposed by Sillanpää and Noykova to a multi-trait setting. We illustrate the implementation of our new model with simulated data, and evaluate the multi-trait model performance with regard to its single-trait counterpart. The data simulation process was based on the multi-trait cQTL model, assuming three traits with uncorrelated and correlated cQTL residuals, with the simulated data under uncorrelated cQTL residuals serving as our test set for comparing the performances of the multi-trait and single-trait models. The simulated data under correlated cQTL residuals were essentially used to assess how well our new model can estimate the cQTL residual covariance structure. The model fitting to the data was carried out by MCMC simulation through OpenBUGS. The multi-trait model outperformed its single-trait counterpart in identifying cQTLs, with a consistently lower false discovery rate. Moreover, the covariance matrix of cQTL residuals was typically estimated to an appreciable degree of precision under the multi-trait cQTL model, making our new model a promising approach to addressing a wide range of issues facing the analysis of correlated clinical traits.
Bayesian multilevel modeling; genetic architecture; linked marker-expression pairs; pleiotropy
In livestock, as in humans, the number of genetic variants that can be tested for association with complex quantitative traits, or used in genomic predictions, is increasing exponentially as whole genome sequencing becomes more common. The power to identify variants associated with traits, particularly those of small effects, could be increased if certain regions of the genome were known a priori to be enriched for associations. Here, we investigate whether twelve genomic annotation classes were enriched or depleted for significant associations in genome wide association studies for complex traits in beef and dairy cattle. We also describe a variance component approach to determine the proportion of genetic variance captured by each annotation class.
P-values from large GWAS using 700K SNP in both dairy and beef cattle were available for 11 and 10 traits respectively. We found significant enrichment for trait associated variants (SNP significant in the GWAS) in the missense class along with regions 5 kilobases upstream and downstream of coding genes. We found that the non-coding conserved regions (across mammals) were not enriched for trait associated variants. The results from the enrichment or depletion analysis were not in complete agreement with the results from variance component analysis, where the missense and synonymous classes gave the greatest increase in variance explained, while the upstream and downstream classes showed a more modest increase in the variance explained.
Our results indicate that functional annotations could assist in prioritization of variants to a subset more likely to be associated with complex traits; including missense variants, and upstream and downstream regions. The differences in two sets of results (GWAS enrichment depletion versus variance component approaches) might be explained by the fact that the variance component approach has greater power to capture the cumulative effect of mutations of small effect, while the enrichment or depletion approach only captures the variants that are significant in GWAS, which is restricted to a limited number of common variants of moderate effects.
Electronic supplementary material
The online version of this article (doi: 10.1186/1471-2164-15-436) contains supplementary material, which is available to authorized users.
Variants component analysis; Regulatory genome; GWAS prioritization; Enrichment depletion
The limited proportion of complex trait variance identified in genome-wide association studies may reflect the limited power of single SNP analyses to detect either rare causative alleles or those of small effect. Motivated by studies that demonstrate that loci contributing to trait variation may contain a number of different alleles, we have developed an analytical approach termed Regional Genomic Relationship Mapping that, like linkage-based family methods, integrates variance contributed by founder gametes within a pedigree. This approach takes advantage of very distant (and unrecorded) relationships, and this greatly increases the power of the method, compared with traditional pedigree-based linkage analyses. By integrating variance contributed by founder gametes in the population, our approach provides an estimate of the Regional Heritability attributable to a small genomic region (e.g. 100 SNP window covering ca. 1 Mb of DNA in a 300000 SNP GWAS) and has the power to detect regions containing multiple alleles that individually contribute too little variance to be detectable by GWAS as well as regions with single common GWAS-detectable SNPs. We use genome-wide SNP array data to obtain both a genome-wide relationship matrix and regional relationship (“identity by state" or IBS) matrices for sequential regions across the genome. We then estimate a heritability for each region sequentially in our genome-wide scan. We demonstrate by simulation and with real data that, when compared to traditional (“individual SNP") GWAS, our method uncovers new loci that explain additional trait variation. We analysed data from three Southern European populations and from Orkney for exemplar traits – serum uric acid concentration and height. We show that regional heritability estimates are correlated with results from genome-wide association analysis but can capture more of the genetic variance segregating in the population and identify additional trait loci.
The exploration of quantitative variation in human populations has become one of the major priorities for medical genetics. The successful identification of variants that contribute to complex traits is highly dependent on reliable assays and genetic maps. We have performed a genome-wide quantitative trait analysis of 630 genes in 60 unrelated Utah residents with ancestry from Northern and Western Europe using the publicly available phase I data of the International HapMap project. The genes are located in regions of the human genome with elevated functional annotation and disease interest including the ENCODE regions spanning 1% of the genome, Chromosome 21 and Chromosome 20q12–13.2. We apply three different methods of multiple test correction, including Bonferroni, false discovery rate, and permutations. For the 374 expressed genes, we find many regions with statistically significant association of single nucleotide polymorphisms (SNPs) with expression variation in lymphoblastoid cell lines after correcting for multiple tests. Based on our analyses, the signal proximal (cis-) to the genes of interest is more abundant and more stable than distal and trans across statistical methodologies. Our results suggest that regulatory polymorphism is widespread in the human genome and show that the 5-kb (phase I) HapMap has sufficient density to enable linkage disequilibrium mapping in humans. Such studies will significantly enhance our ability to annotate the non-coding part of the genome and interpret functional variation. In addition, we demonstrate that the HapMap cell lines themselves may serve as a useful resource for quantitative measurements at the cellular level.
With the finished reference sequence of the human genome now available, focus has shifted towards trying to identify all of the functional elements within the sequence. Although quite a lot of progress has been made towards identifying some classes of genomic elements, in particular protein-coding sequences, the characterization of regulatory elements remains a challenge. The authors describe the genetic mapping of regions of the genome that have functional effects on quantitative levels of gene expression. Gene expression of 630 genes was measured in cell lines derived from 60 unrelated human individuals, the same Utah residents of Northern and Western European ancestry that have been genetically well-characterized by The International HapMap Project. This paper reports significant variation among individuals with respect to levels of gene expression, and demonstrates that this quantitative trait has a genetic basis. For some genes, the genetic signal was localized to specific locations in the human genome sequence; in most cases the genomic region associated with expression variation was physically close to the gene whose expression it regulated. The authors demonstrate the feasibility of performing whole-genome association scans to map quantitative traits, and highlight statistical issues that are increasingly important for whole-genome disease mapping studies.
Flowering time is a key life-history trait in the plant life cycle. Most studies to unravel the genetics of flowering time in Arabidopsis thaliana have been performed under greenhouse conditions. Here, we describe a study about the genetics of flowering time that differs from previous studies in two important ways: first, we measure flowering time in a more complex and ecologically realistic environment; and, second, we combine the advantages of genome-wide association (GWA) and traditional linkage (QTL) mapping. Our experiments involved phenotyping nearly 20,000 plants over 2 winters under field conditions, including 184 worldwide natural accessions genotyped for 216,509 SNPs and 4,366 RILs derived from 13 independent crosses chosen to maximize genetic and phenotypic diversity. Based on a photothermal time model, the flowering time variation scored in our field experiment was poorly correlated with the flowering time variation previously obtained under greenhouse conditions, reinforcing previous demonstrations of the importance of genotype by environment interactions in A. thaliana and the need to study adaptive variation under natural conditions. The use of 4,366 RILs provides great power for dissecting the genetic architecture of flowering time in A. thaliana under our specific field conditions. We describe more than 60 additive QTLs, all with relatively small to medium effects and organized in 5 major clusters. We show that QTL mapping increases our power to distinguish true from false associations in GWA mapping. QTL mapping also permits the identification of false negatives, that is, causative SNPs that are lost when applying GWA methods that control for population structure. Major genes underpinning flowering time in the greenhouse were not associated with flowering time in this study. Instead, we found a prevalence of genes involved in the regulation of the plant circadian clock. Furthermore, we identified new genomic regions lacking obvious candidate genes.
Dissecting the genetic bases of adaptive traits is of primary importance in evolutionary biology. In this study, we combined a genome-wide association (GWA) study with traditional linkage mapping in order to detect the genetic bases underlying natural variation in flowering time in ecologically realistic conditions in the plant Arabidopsis thaliana. Our study involved phenotyping nearly 20,000 plants over 2 winters under field conditions in a temperate climate. We show that combined linkage and association mapping clearly outperforms each method alone when it comes to identifying true associations. This highlights the utility of combining different methods to localize genes involved in complex trait natural variation. Most candidate genes found in this study are involved in the regulation of the plant circadian clock and, surprisingly, were not associated with flowering time scored under greenhouse conditions. While rapid advances have been made in high-throughput genotyping and sequencing, high-throughput phenotyping of complex traits under natural conditions will be the next challenge for dissecting the genetic bases of adaptive variation in “laboratory” model organisms.
Genome-wide association mapping is highly sensitive to environmental changes, but network analysis allows rapid causal gene identification.
Genome-wide association (GWA) is gaining popularity as a means to study the architecture of complex quantitative traits, partially due to the improvement of high-throughput low-cost genotyping and phenotyping technologies. Glucosinolate (GSL) secondary metabolites within Arabidopsis spp. can serve as a model system to understand the genomic architecture of adaptive quantitative traits. GSL are key anti-herbivory defenses that impart adaptive advantages within field trials. While little is known about how variation in the external or internal environment of an organism may influence the efficiency of GWA, GSL variation is known to be highly dependent upon the external stresses and developmental processes of the plant lending it to be an excellent model for studying conditional GWA.
To understand how development and environment can influence GWA, we conducted a study using 96 Arabidopsis thaliana accessions, >40 GSL phenotypes across three conditions (one developmental comparison and one environmental comparison) and ∼230,000 SNPs. Developmental stage had dramatic effects on the outcome of GWA, with each stage identifying different loci associated with GSL traits. Further, while the molecular bases of numerous quantitative trait loci (QTL) controlling GSL traits have been identified, there is currently no estimate of how many additional genes may control natural variation in these traits. We developed a novel co-expression network approach to prioritize the thousands of GWA candidates and successfully validated a large number of these genes as influencing GSL accumulation within A. thaliana using single gene isogenic lines.
Together, these results suggest that complex traits imparting environmentally contingent adaptive advantages are likely influenced by up to thousands of loci that are sensitive to fluctuations in the environment or developmental state of the organism. Additionally, while GWA is highly conditional upon genetics, the use of additional genomic information can rapidly identify causal loci en masse.
Understanding how genetic variation can control phenotypic variation is a fundamental goal of modern biology. A major push has been made using genome-wide association mapping in all organisms to attempt and rapidly identify the genes contributing to phenotypes such as disease and nutritional disorders. But a number of fundamental questions have not been answered about the use of genome-wide association: for example, how does the internal or external environment influence the genes found? Furthermore, the simple question of how many genes may influence a trait is unknown. Finally, a number of studies have identified significant false-positive and -negative issues within genome-wide association studies that are not solvable by direct statistical approaches. We have used genome-wide association mapping in the plant Arabidopsis thaliana to begin exploring these questions. We show that both external and internal environments significantly alter the identified genes, such that using different tissues can lead to the identification of nearly completely different gene sets. Given the large number of potential false-positives, we developed an orthogonal approach to filtering the possible genes, by identifying co-functioning networks using the nominal candidate gene list derived from genome-wide association studies. This allowed us to rapidly identify and validate a large number of novel and unexpected genes that affect Arabidopsis thaliana defense metabolism within phenotypic ranges that have been shown to be selectable within the field. These genes and the associated networks suggest that Arabidopsis thaliana defense metabolism is more readily similar to the infinite gene hypothesis, according to which there is a vast number of causative genes controlling natural variation in this phenotype. It remains to be seen how frequently this is true for other organisms and other phenotypes.
Despite important advances from Genome Wide Association Studies (GWAS), for most complex human traits and diseases, a sizable proportion of genetic variance remains unexplained and prediction accuracy (PA) is usually low. Evidence suggests that PA can be improved using Whole-Genome Regression (WGR) models where phenotypes are regressed on hundreds of thousands of variants simultaneously. The Genomic Best Linear Unbiased Prediction (G-BLUP, a ridge-regression type method) is a commonly used WGR method and has shown good predictive performance when applied to plant and animal breeding populations. However, breeding and human populations differ greatly in a number of factors that can affect the predictive performance of G-BLUP. Using theory, simulations, and real data analysis, we study the performance of G-BLUP when applied to data from related and unrelated human subjects. Under perfect linkage disequilibrium (LD) between markers and QTL, the prediction R-squared (R2) of G-BLUP reaches trait-heritability, asymptotically. However, under imperfect LD between markers and QTL, prediction R2 based on G-BLUP has a much lower upper bound. We show that the minimum decrease in prediction accuracy caused by imperfect LD between markers and QTL is given by (1−b)2, where b is the regression of marker-derived genomic relationships on those realized at causal loci. For pairs of related individuals, due to within-family disequilibrium, the patterns of realized genomic similarity are similar across the genome; therefore b is close to one inducing small decrease in R2. However, with distantly related individuals b reaches very low values imposing a very low upper bound on prediction R2. Our simulations suggest that for the analysis of data from unrelated individuals, the asymptotic upper bound on R2 may be of the order of 20% of the trait heritability. We show how PA can be enhanced with use of variable selection or differential shrinkage of estimates of marker effects.
Despite great advances in genotyping technologies, the ability to predict complex traits and diseases remains limited. Increasing evidence suggests that many of these traits may be affected by a large number of small-effect genes that are difficult to detect in single-variant association studies. Whole-Genome Regression (WGR) methods can be used to confront this challenge and have exhibited good predictive power when applied to animal and plant breeding populations. WGR is receiving increased attention in the field of human genetics. However, human and breeding populations differ greatly in factors that can affect the performance of WGRs. Using theory, simulation and real data analysis, we study the predictive performance of the Genomic Best Linear Unbiased Predictor (G-BLUP), one of the most commonly used WGR methods. We derive upper bounds for the prediction accuracy of G-BLUP under perfect and imperfect LD between markers and genotypes at causal loci and validate such upper bounds using simulation and real data analysis. Imperfect LD between markers and causal loci can impose a very low upper bound on the prediction accuracy of G-BLUP, especially when data involve unrelated individuals. In this context, we propose and evaluate avenues for improving the predictive performance of G-BLUP.
We have recently developed analysis methods (GREML) to estimate the genetic variance of a complex trait/disease and the genetic correlation between two complex traits/diseases using genome-wide single nucleotide polymorphism (SNP) data in unrelated individuals. Here we use analytical derivations and simulations to quantify the sampling variance of the estimate of the proportion of phenotypic variance captured by all SNPs for quantitative traits and case-control studies. We also derive the approximate sampling variance of the estimate of a genetic correlation in a bivariate analysis, when two complex traits are either measured on the same or different individuals. We show that the sampling variance is inversely proportional to the number of pairwise contrasts in the analysis and to the variance in SNP-derived genetic relationships. For bivariate analysis, the sampling variance of the genetic correlation additionally depends on the harmonic mean of the proportion of variance explained by the SNPs for the two traits and the genetic correlation between the traits, and depends on the phenotypic correlation when the traits are measured on the same individuals. We provide an online tool for calculating the power of detecting genetic (co)variation using genome-wide SNP data. The new theory and online tool will be helpful to plan experimental designs to estimate the missing heritability that has not yet been fully revealed through genome-wide association studies, and to estimate the genetic overlap between complex traits (diseases) in particular when the traits (diseases) are not measured on the same samples.
Genome-wide association studies (GWAS) have identified thousands of genetic variants for hundreds of traits and diseases. However, the genetic variants discovered from GWAS only explained a small fraction of the heritability, resulting in the question of “missing heritability”. We have recently developed approaches (called GREML) to estimate the overall contribution of all SNPs to the phenotypic variance of a trait (disease) and the proportion of genetic overlap between traits (diseases). A frequently asked question is that how many samples are required to estimate the proportion of variance attributable to all SNPs and the proportion of genetic overlap with useful precision. In this study, we derive the standard errors of the estimated parameters from theory and find that they are highly consistent with those observed values from published results and those obtained from simulation. The theory together with an online application tool will be helpful to plan experimental design to quantify the missing heritability, and to estimate the genetic overlap between traits (diseases) especially when it is unfeasible to have the traits (diseases) measured on the same individuals.
Understanding how genomes encode complex cellular and organismal behaviors has become the outstanding challenge of modern genetics. Unlike classical screening methods, analysis of genetic variation that occurs naturally in wild populations can enable rapid, genome-scale mapping of genotype to phenotype with a medium-throughput experimental design. Here we describe the results of the first genome-wide association study (GWAS) used to identify novel loci underlying trait variation in a microbial eukaryote, harnessing wild isolates of the filamentous fungus Neurospora crassa. We genotyped each of a population of wild Louisiana strains at 1 million genetic loci genome-wide, and we used these genotypes to map genetic determinants of microbial communication. In N. crassa, germinated asexual spores (germlings) sense the presence of other germlings, grow toward them in a coordinated fashion, and fuse. We evaluated germlings of each strain for their ability to chemically sense, chemotropically seek, and undergo cell fusion, and we subjected these trait measurements to GWAS. This analysis identified one gene, NCU04379 (cse-1, encoding a homolog of a neuronal calcium sensor), at which inheritance was strongly associated with the efficiency of germling communication. Deletion of cse-1 significantly impaired germling communication and fusion, and two genes encoding predicted interaction partners of CSE1 were also required for the communication trait. Additionally, mining our association results for signaling and secretion genes with a potential role in germling communication, we validated six more previously unknown molecular players, including a secreted protease and two other genes whose deletion conferred a novel phenotype of increased communication and multi-germling fusion. Our results establish protein secretion as a linchpin of germling communication in N. crassa and shed light on the regulation of communication molecules in this fungus. Our study demonstrates the power of population-genetic analyses for the rapid identification of genes contributing to complex traits in microbial species.
Many phenotypes of interest are controlled by multiple loci, and in biological systems identifying determinants of such complex traits is challenging. Here, we genotyped 112 wild isolates of Neurospora crassa and used this resource to identify genes that mediate a fundamental but poorly-understood attribute of this filamentous fungus: the ability of germinating spores to sense each other at a distance, extend projections toward one another, and fuse. Inheritance at a secretion gene, cse-1, was associated strongly with germling communication across wild strains; this association was validated in experiments showing reduced communication in a cse-1 deletion strain. By testing interacting partners of CSE1, and by assessing additional secretion and signaling factors whose inheritance associated more modestly with germling communication in wild strains, we identified eight other novel determinants of this phenotype. Our population of genotyped wild isolates provides a flexible and powerful community resource for the rapid identification of any varying, complex phenotype in N. crassa. The success of our approach, which used a phenotyping scheme far more tractable than would be required in a screen of the entire N. crassa gene deletion collection, serves as a proof of concept for association studies of wild populations for any organism.
Blood lipid levels including low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG) are highly heritable. Genome-wide association is a promising approach to map genetic loci related to these heritable phenotypes.
In 1087 Framingham Heart Study Offspring cohort participants (mean age 47 years, 52% women), we conducted genome-wide analyses (Affymetrix 100K GeneChip) for fasting blood lipid traits. Total cholesterol, HDL-C, and TG were measured by standard enzymatic methods and LDL-C was calculated using the Friedewald formula. The long-term averages of up to seven measurements of LDL-C, HDL-C, and TG over a ~30 year span were the primary phenotypes. We used generalized estimating equations (GEE), family-based association tests (FBAT) and variance components linkage to investigate the relationships between SNPs (on autosomes, with minor allele frequency ≥10%, genotypic call rate ≥80%, and Hardy-Weinberg equilibrium p ≥ 0.001) and multivariable-adjusted residuals. We pursued a three-stage replication strategy of the GEE association results with 287 SNPs (P < 0.001 in Stage I) tested in Stage II (n ~1450 individuals) and 40 SNPs (P < 0.001 in joint analysis of Stages I and II) tested in Stage III (n~6650 individuals).
Long-term averages of LDL-C, HDL-C, and TG were highly heritable (h2 = 0.66, 0.69, 0.58, respectively; each P < 0.0001). Of 70,987 tests for each of the phenotypes, two SNPs had p < 10-5 in GEE results for LDL-C, four for HDL-C, and one for TG. For each multivariable-adjusted phenotype, the number of SNPs with association p < 10-4 ranged from 13 to 18 and with p < 10-3, from 94 to 149. Some results confirmed previously reported associations with candidate genes including variation in the lipoprotein lipase gene (LPL) and HDL-C and TG (rs7007797; P = 0.0005 for HDL-C and 0.002 for TG). The full set of GEE, FBAT and linkage results are posted at the database of Genotype and Phenotype (dbGaP). After three stages of replication, there was no convincing statistical evidence for association (i.e., combined P < 10-5 across all three stages) between any of the tested SNPs and lipid phenotypes.
Using a 100K genome-wide scan, we have generated a set of putative associations for common sequence variants and lipid phenotypes. Validation of selected hypotheses in additional samples did not identify any new loci underlying variability in blood lipids. Lack of replication may be due to inadequate statistical power to detect modest quantitative trait locus effects (i.e., <1% of trait variance explained) or reduced genomic coverage of the 100K array. GWAS in FHS using a denser genome-wide genotyping platform and a better-powered replication strategy may identify novel loci underlying blood lipids.
Current genome-wide association studies (GWAS) have high power to detect intermediate frequency SNPs making modest contributions to complex disease, but they are underpowered to detect rare alleles of large effect (RALE). This has led to speculation that the bulk of variation for most complex diseases is due to RALE. One concern with existing models of RALE is that they do not make explicit assumptions about the evolution of a phenotype and its molecular basis. Rather, much of the existing literature relies on arbitrary mapping of phenotypes onto genotypes obtained either from standard population-genetic simulation tools or from non-genetic models. We introduce a novel simulation of a 100-kilobase gene region, based on the standard definition of a gene, in which mutations are unconditionally deleterious, are continuously arising, have partially recessive and non-complementing effects on phenotype (analogous to what is widely observed for most Mendelian disorders), and are interspersed with neutral markers that can be genotyped. Genes evolving according to this model exhibit a characteristic GWAS signature consisting of an excess of marginally significant markers. Existing tests for an excess burden of rare alleles in cases have low power while a simple new statistic has high power to identify disease genes evolving under our model. The structure of linkage disequilibrium between causative mutations and significantly associated markers under our model differs fundamentally from that seen when rare causative markers are assumed to be neutral. Rather than tagging single haplotypes bearing a large number of rare causative alleles, we find that significant SNPs in a GWAS tend to tag single causative mutations of small effect relative to other mutations in the same gene. Our results emphasize the importance of evaluating the power to detect associations under models that are genetically and evolutionarily motivated.
Current GWA studies typically only explain a small fraction of heritable variation in complex traits, resulting in speculation that a large fraction of variation in such traits may be due to rare alleles of large effect (RALE). The most parsimonious evolutionary mechanism that results in an inverse relationship between the frequency and effect size of causative alleles is an equilibrium between newly arising deleterious mutations and selection eliminating those mutations, resulting in an inverse relation between effect size and average frequency. This assumption is not built into many current models of RALE and, as a result, power calculations may be misleading. We use forward population genetic simulations to explore the ability of GWAS to detect genes in which unconditionally deleterious, partially recessive mutations arise each generation. Our model is based on the standard definition of a gene as a region within which loss-of-function mutations fail to complement, consistent with the multi-allelic basis for Mendelian disorders. Our model predicts that it may not be uncommon for single genes evolving under our model to contribute upwards of 5% to variation in a complex trait, and that such genes could be routinely detected via modified GWAS approaches.
Polymorphisms that affect complex traits or quantitative trait loci (QTL) often affect multiple traits. We describe two novel methods (1) for finding single nucleotide polymorphisms (SNPs) significantly associated with one or more traits using a multi-trait, meta-analysis, and (2) for distinguishing between a single pleiotropic QTL and multiple linked QTL. The meta-analysis uses the effect of each SNP on each of n traits, estimated in single trait genome wide association studies (GWAS). These effects are expressed as a vector of signed t-values (t) and the error covariance matrix of these t values is approximated by the correlation matrix of t-values among the traits calculated across the SNP (V). Consequently, t'V−1t is approximately distributed as a chi-squared with n degrees of freedom. An attractive feature of the meta-analysis is that it uses estimated effects of SNPs from single trait GWAS, so it can be applied to published data where individual records are not available. We demonstrate that the multi-trait method can be used to increase the power (numbers of SNPs validated in an independent population) of GWAS in a beef cattle data set including 10,191 animals genotyped for 729,068 SNPs with 32 traits recorded, including growth and reproduction traits. We can distinguish between a single pleiotropic QTL and multiple linked QTL because multiple SNPs tagging the same QTL show the same pattern of effects across traits. We confirm this finding by demonstrating that when one SNP is included in the statistical model the other SNPs have a non-significant effect. In the beef cattle data set, cluster analysis yielded four groups of QTL with similar patterns of effects across traits within a group. A linear index was used to validate SNPs having effects on multiple traits and to identify additional SNPs belonging to these four groups.
We describe novel methods for finding significant associations between a genome wide panel of SNPs and multiple complex traits, and further for distinguishing between genes with effects on multiple traits and multiple linked genes affecting different traits. The method uses a meta-analysis based on estimates of SNP effects from independent single trait genome wide association studies (GWAS). The method could therefore be widely used to combine already published GWAS results. The method was applied to 32 traits that describe growth, body composition, feed intake and reproduction in 10,191 beef cattle genotyped for approximately 700,000 SNP. The genes found to be associated with these traits can be arranged into 4 groups that differ in their pattern of effects and hence presumably in their physiological mechanism of action. For instance, one group of genes affects weight and fatness in the opposite direction and can be described as a group of genes affecting mature size, while another group affects weight and fatness in the same direction.
Adult height is a classic polygenic trait of high heritability (h2 ∼0.8). More than 180 single nucleotide polymorphisms (SNPs), identified mostly in populations of European descent, are associated with height. These variants convey modest effects and explain ∼10% of the variance in height. Discovery efforts in other populations, while limited, have revealed loci for height not previously implicated in individuals of European ancestry. Here, we performed a meta-analysis of genome-wide association (GWA) results for adult height in 20,427 individuals of African ancestry with replication in up to 16,436 African Americans. We found two novel height loci (Xp22-rs12393627, P = 3.4×10−12 and 2p14-rs4315565, P = 1.2×10−8). As a group, height associations discovered in European-ancestry samples replicate in individuals of African ancestry (P = 1.7×10−4 for overall replication). Fine-mapping of the European height loci in African-ancestry individuals showed an enrichment of SNPs that are associated with expression of nearby genes when compared to the index European height SNPs (P<0.01). Our results highlight the utility of genetic studies in non-European populations to understand the etiology of complex human diseases and traits.
Adult height is an ideal phenotype to improve our understanding of the genetic architecture of complex diseases and traits: it is easily measured and usually available in large cohorts, relatively stable, and mostly influenced by genetics (narrow-sense heritability of height h2∼0.8). Genome-wide association (GWA) studies in individuals of European ancestry have identified >180 single nucleotide polymorphisms (SNPs) associated with height. In the current study, we continued to use height as a model polygenic trait and explored the genetic influence in populations of African ancestry through a meta-analysis of GWA height results from 20,809 individuals of African descent. We identified two novel height loci not previously found in Europeans. We also replicated the European height signals, suggesting that many of the genetic variants that are associated with height are shared between individuals of European and African descent. Finally, in fine-mapping the European height loci in African-ancestry individuals, we found SNPs more likely to be associated with the expression of nearby genes than the SNPs originally found in Europeans. Thus, our results support the utility of performing genetic studies in non-European populations to gain insights into complex human diseases and traits.
Recent advances in high-throughput genotyping technologies have provided the opportunity to map genes using associations between complex traits and markers. Genome-wide association studies (GWAS) based on either a single marker or haplotype have identified genetic variants and underlying genetic mechanisms of quantitative traits. Prompted by the achievements of studies examining economic traits in cattle and to verify the consistency of these two methods using real data, the current study was conducted to construct the haplotype structure in the bovine genome and to detect relevant genes genuinely affecting a carcass trait and a meat quality trait. Using the Illumina BovineHD BeadChip, 942 young bulls with genotyping data were introduced as a reference population to identify the genes in the beef cattle genome significantly associated with foreshank weight and triglyceride levels. In total, 92,553 haplotype blocks were detected in the genome. The regions of high linkage disequilibrium extended up to approximately 200 kb, and the size of haplotype blocks ranged from 22 bp to 199,266 bp. Additionally, the individual SNP analysis and the haplotype-based analysis detected similar regions and common SNPs for these two representative traits. A total of 12 and 7 SNPs in the bovine genome were significantly associated with foreshank weight and triglyceride levels, respectively. By comparison, 4 and 5 haplotype blocks containing the majority of significant SNPs were strongly associated with foreshank weight and triglyceride levels, respectively. In addition, 36 SNPs with high linkage disequilibrium were detected in the GNAQ gene, a potential hotspot that may play a crucial role for regulating carcass trait components.
Quantitative traits are conditioned by several genetic determinants. Since such genes influence many important complex traits in various organisms, the identification of quantitative trait loci (QTLs) is of major interest, but still encounters serious difficulties. We detected four linked genes within one QTL, which participate in controlling sporulation efficiency in Saccharomyces cerevisiae. Following the identification of single nucleotide polymorphisms by comparing the sequences of 145 genes between the parental strains SK1 and S288c, we analyzed the segregating progeny of the cross between them. Through reciprocal hemizygosity analysis, four genes, RAS2, PMS1, SWS2, and FKH2, located in a region of 60 kilobases on Chromosome 14, were found to be associated with sporulation efficiency. Three of the four “high” sporulation alleles are derived from the “low” sporulating strain. Two of these sporulation-related genes were verified through allele replacements. For RAS2, the causative variation was suggested to be a single nucleotide difference in the upstream region of the gene. This quantitative trait nucleotide accounts for sporulation variability among a set of ten closely related winery yeast strains. Our results provide a detailed view of genetic complexity in one “QTL region” that controls a quantitative trait and reports a single nucleotide polymorphism-trait association in wild strains. Moreover, these findings have implications on QTL identification in higher eukaryotes.
Genes controlling many medically and agriculturally important complex traits in various organisms and their organization as quantitative trait loci (QTLs) are of major interest. To identify QTLs responsible for such a quantitative trait, the authors employed a two-step strategy: First, single-nucleotide markers (called SNPs) distributed throughout the genome were screened for prevalence among progeny with extreme characteristics, thus identifying three candidate genomic regions. Next, in one of these regions, manipulation of individual genes revealed four tightly linked genes that affected the trait, sporulation efficiency. A fifth gene that affects sporulation was recently and independently identified in the same region. This 60-kilobase region has a complex and interesting architecture: One strain, which sporulates efficiently, has sporulation-promoting alleles (alternative forms) at two major genes and inhibiting alleles at the three less important ones, whereas another strain, with inefficient sporulation, has the opposite alleles at the five genes. Moreover, one causative SNP for this trait, in the promoter region of the gene RAS2, explains sporulation differences among a set of ten winery yeast strains. These results provide a detailed view of genetic complexity in one “QTL region” and an SNP-trait association example among wild strains.
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.
Maize is the most widely grown cereal in the world. In addition to its role in global agriculture, it has also long served as a model organism for genetic research. Maize stands at a genetic crossroads, as it has access to all the tools available for plant genetics but exhibits a genetic architecture more similar to other outcrossing organisms than to self-pollinating crops and model plants. In this review, we summarize recent advances in maize genetics, including the development of powerful populations for genetic mapping and genome-wide association studies (GWAS), and the insights these studies yield on the mechanisms underlying complex maize traits. Most maize traits are controlled by a large number of genes, and linkage analysis of several traits implicates a ‘common gene, rare allele' model of genetic variation where some genes have many individually rare alleles contributing. Most natural alleles exhibit small effect sizes with little-to-no detectable pleiotropy or epistasis. Additionally, many of these genes are locked away in low-recombination regions that encourage the formation of multi-gene blocks that may underlie maize's strong heterotic effect. Domestication left strong marks on the maize genome, and some of the differences in trait architectures may be due to different selective pressures over time. Overall, maize's advantages as a model system make it highly desirable for studying the genetics of outcrossing species, and results from it can provide insight into other such species, including humans.
maize; nested association mapping; quantitative traits; genome-wide association; heterosis
Osteoporosis is a complex disorder and commonly leads to fractures in elderly persons. Genome-wide association studies (GWAS) have become an unbiased approach to identify variations in the genome that potentially affect health. However, the genetic variants identified so far only explain a small proportion of the heritability for complex traits. Due to the modest genetic effect size and inadequate power, true association signals may not be revealed based on a stringent genome-wide significance threshold. Here, we take advantage of SNP and transcript arrays and integrate GWAS and expression signature profiling relevant to the skeletal system in cellular and animal models to prioritize the discovery of novel candidate genes for osteoporosis-related traits, including bone mineral density (BMD) at the lumbar spine (LS) and femoral neck (FN), as well as geometric indices of the hip (femoral neck-shaft angle, NSA; femoral neck length, NL; and narrow-neck width, NW). A two-stage meta-analysis of GWAS from 7,633 Caucasian women and 3,657 men, revealed three novel loci associated with osteoporosis-related traits, including chromosome 1p13.2 (RAP1A, p = 3.6×10−8), 2q11.2 (TBC1D8), and 18q11.2 (OSBPL1A), and confirmed a previously reported region near TNFRSF11B/OPG gene. We also prioritized 16 suggestive genome-wide significant candidate genes based on their potential involvement in skeletal metabolism. Among them, 3 candidate genes were associated with BMD in women. Notably, 2 out of these 3 genes (GPR177, p = 2.6×10−13; SOX6, p = 6.4×10−10) associated with BMD in women have been successfully replicated in a large-scale meta-analysis of BMD, but none of the non-prioritized candidates (associated with BMD) did. Our results support the concept of our prioritization strategy. In the absence of direct biological support for identified genes, we highlighted the efficiency of subsequent functional characterization using publicly available expression profiling relevant to the skeletal system in cellular or whole animal models to prioritize candidate genes for further functional validation.
BMD and hip geometry are two major predictors of osteoporotic fractures, the most severe consequence of osteoporosis in elderly persons. We performed sex-specific genome-wide association studies (GWAS) for BMD at the lumbar spine and femor neck skeletal sites as well as hip geometric indices (NSA, NL, and NW) in the Framingham Osteoporosis Study and then replicated the top findings in two independent studies. Three novel loci were significant: in women, including chromosome 1p13.2 (RAP1A) for NW; in men, 2q11.2 (TBC1D8) for NSA and 18q11.2 (OSBPL1A) for NW. We confirmed a previously reported region on 8q24.12 (TNFRSF11B/OPG) for lumbar spine BMD in women. In addition, we integrated GWAS signals with eQTL in several tissues and publicly available expression signature profiling in cellular and whole-animal models, and prioritized 16 candidate genes/loci based on their potential involvement in skeletal metabolism. Among three prioritized loci (GPR177, SOX6, and CASR genes) associated with BMD in women, GPR177 and SOX6 have been successfully replicated later in a large-scale meta-analysis, but none of the non-prioritized candidates (associated with BMD) did. Our results support the concept of using expression profiling to support the candidacy of suggestive GWAS signals that may contain important genes of interest.
Genome-wide association studies can have limited power to identify QTL, partly due to the stringent correction for multiple testing and low linkage-disequilibrium between SNPs and QTL. Regional Heritability Mapping (RHM) has been advanced as an alternative approach to capture underlying genetic effects. In this study, RHM was used to identify loci underlying variation in the 16th QTLMAS workshop simulated traits.
The method was implemented by fitting a mixed model where a genomic region and the overall genetic background were added as random effects. Heritabilities for the genetic regional effects were estimated, and the presence of a QTL in the region was tested using a likelihood ratio test (LRT). Several region sizes were considered (100, 50 and 20 adjacent SNPs). Bonferroni correction was used to calculate the LRT thresholds for genome-wide (p < 0.05) and suggestive (i.e., one false positive per genome scan) significance.
Genomic heritabilities (0.31, 0.32 and 0.48, respectively) and genetic correlations (0.80, -0.42 and 0.19, between trait-pairs 1&2, 1&3 and 2&3) were similar to the simulated ones. RHM identified 7 QTL (4 at genome-wide and 3 at suggestive level) for Trait1; 4 (2 genome-wide and 2 suggestive) for Trait2; and 7 (6 genome-wide and 1 suggestive) for Trait3. Only one of the identified suggestive QTL was a false-positive. The position of these QTL tended to coincide with the position where the largest QTL (or several of them) were simulated. Several signals were detected for the simulated QTL with smaller effect. A combined analysis including all significant regions showed that they explain more than half of the total genetic variance of the traits. However, this might be overestimated, due to Beavis effect. All QTL affecting traits 1&2 and 2&3 had positive correlations, following the trend of the overall correlation of both trait-pairs. All but one QTL affecting traits 1&3 were negatively correlated, in agreement with the simulated situation. Moreover, RHM identified extra loci that were not found by association and linkage analysis, highlighting the improved power of this approach.
RHM identified the largest QTL among the simulated ones, with some signals for the ones with small effect. Moreover, RHM performed better than association and linkage analysis, in terms of both power and resolution.
In the last years GWA studies have successfully identified common SNPs associated with complex diseases. However, most of the variants found this way account for only a small portion of the trait variance. This fact leads researchers to focus on rare-variant mapping with large scale sequencing, which can be facilitated by using linkage information. The question arises why linkage analysis often fails to identify genes when analyzing complex diseases. Using simulations we have investigated the power of parametric and nonparametric linkage statistics (KC-LOD, NPL, LOD and MOD scores), to detect the effect of genes responsible for complex diseases using different pedigree structures.
As expected, a small number of pedigrees with less than three affected individuals has low power to map disease genes with modest effect. Interestingly, the power decreases when unaffected individuals are included in the analysis, irrespective of the true mode of inheritance. Furthermore, we found that the best performing statistic depends not only on the type of pedigrees but also on the true mode of inheritance.
When applied in a sensible way linkage is an appropriate and robust technique to map genes for complex disease. Unlike association analysis, linkage analysis is not hampered by allelic heterogeneity. So, why does linkage analysis often fail with complex diseases? Evidently, when using an insufficient number of small pedigrees, one might miss a true genetic linkage when actually a real effect exists. Furthermore, we show that the test statistic has an important effect on the power to detect linkage as well. Therefore, a linkage analysis might fail if an inadequate test statistic is employed. We provide recommendations regarding the most favorable test statistics, in terms of power, for a given mode of inheritance and type of pedigrees under study, in order to reduce the probability to miss a true linkage.
Linkage; Parametric analysis; Nonparametric analysis; NPL score; LOD score; MOD score; Complex diseases; Rare variants
Many complex disease syndromes, such as asthma, consist of a large number of highly related, rather than independent, clinical or molecular phenotypes. This raises a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. In this study, we propose a new statistical framework called graph-guided fused lasso (GFlasso) to directly and effectively incorporate the correlation structure of multiple quantitative traits such as clinical metrics and gene expressions in association analysis. Our approach represents correlation information explicitly among the quantitative traits as a quantitative trait network (QTN) and then leverages this network to encode structured regularization functions in a multivariate regression model over the genotypes and traits. The result is that the genetic markers that jointly influence subgroups of highly correlated traits can be detected jointly with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently and combined the results afterwards, our approach analyzes all of the traits jointly in a single statistical framework. This allows our method to borrow information across correlated phenotypes to discover the genetic markers that perturb a subset of the correlated traits synergistically. Using simulated datasets based on the HapMap consortium and an asthma dataset, we compared the performance of our method with other methods based on single-marker analysis and regression-based methods that do not use any of the relational information in the traits. We found that our method showed an increased power in detecting causal variants affecting correlated traits. Our results showed that, when correlation patterns among traits in a QTN are considered explicitly and directly during a structured multivariate genome association analysis using our proposed methods, the power of detecting true causal SNPs with possibly pleiotropic effects increased significantly without compromising performance on non-pleiotropic SNPs.
An association study examines a phenotype against genotypic variations over a large set of individuals in order to find the genetic variant that gives rise to the variation in the phenotype. Many complex disease syndromes consist of a large number of highly related clinical phenotypes, and the patient cohorts are routinely surveyed with a large number of traits, such as hundreds of clinical phenotypes and genome-wide profiling of thousands of gene expressions, many of which are correlated. However, most of the conventional approaches for association mapping or eQTL analysis consider a single phenotype at a time instead of taking advantage of the relatedness of traits by analyzing them jointly. Assuming that a group of tightly correlated traits may share a common genetic basis, in this paper, we present a new framework for association analysis that searches for genetic variations influencing a group of correlated traits. We explicitly represent the correlation information in multiple quantitative traits as a quantitative trait network and directly incorporate this network information to scan the genome for association. Our results on simulated and asthma data show that our approach has a significant advantage in detecting associations when a genetic marker perturbs synergistically a group of traits.
Genome-wide association studies (GWAS) have provided valuable insights into the genetic basis of complex traits. However, they have explained relatively little trait heritability. Recently, we proposed a new analytical approach called regional heritability mapping (RHM) that captures more of the missing genetic variation. This method is applicable both to related and unrelated populations. Here, we demonstrate the power of RHM in comparison with single-SNP GWAS and gene-based association approaches under a wide range of scenarios with variable numbers of quantitative trait loci (QTL) with common and rare causal variants in a narrow genomic region. Simulations based on real genotype data were performed to assess power to capture QTL variance, and we demonstrate that RHM has greater power to detect rare variants and/or multiple alleles in a region than other approaches. In addition, we show that RHM can capture more accurately the QTL variance, when it is caused by multiple independent effects and/or rare variants. We applied RHM to analyze three biometrical eye traits for which single-SNP GWAS have been published or performed to evaluate the effectiveness of this method in real data analysis and detected some additional loci which were not detected by other GWAS methods. RHM has the potential to explain some of missing heritability by capturing variance caused by QTL with low MAF and multiple independent QTL in a region, not captured by other GWAS methods. RHM analyses can be implemented using the software REACTA (http://www.epcc.ed.ac.uk/projects-portfolio/reacta).
common and rare variants; GWAS; regional heritability mapping; multiple independent effects; missing heritability