Complex trait genome-wide association studies (GWAS) provide an efficient strategy for evaluating large numbers of common variants in large numbers of individuals and for identifying trait-associated variants. Nevertheless, GWAS often leave much of the trait heritability unexplained. We hypothesized that some of this unexplained heritability might be due to common and rare variants that reside in GWAS identified loci but lack appropriate proxies in modern genotyping arrays. To assess this hypothesis, we re-examined 7 genes (APOE, APOC1, APOC2, SORT1, LDLR, APOB, and PCSK9) in 5 loci associated with low-density lipoprotein cholesterol (LDL-C) in multiple GWAS. For each gene, we first catalogued genetic variation by re-sequencing 256 Sardinian individuals with extreme LDL-C values. Next, we genotyped variants identified by us and by the 1000 Genomes Project (totaling 3,277 SNPs) in 5,524 volunteers. We found that in one locus (PCSK9) the GWAS signal could be explained by a previously described low-frequency variant and that in three loci (PCSK9, APOE, and LDLR) there were additional variants independently associated with LDL-C, including a novel and rare LDLR variant that seems specific to Sardinians. Overall, this more detailed assessment of SNP variation in these loci increased estimates of the heritability of LDL-C accounted for by these genes from 3.1% to 6.5%. All association signals and the heritability estimates were successfully confirmed in a sample of ∼10,000 Finnish and Norwegian individuals. Our results thus suggest that focusing on variants accessible via GWAS can lead to clear underestimates of the trait heritability explained by a set of loci. Further, our results suggest that, as prelude to large-scale sequencing efforts, targeted re-sequencing efforts paired with large-scale genotyping will increase estimates of complex trait heritability explained by known loci.
Despite the striking success of genome-wide association studies in identifying genetic loci associated with common complex traits and diseases, much of the heritable risk for these traits and diseases remains unexplained. A higher resolution investigation of the genome through sequencing studies is expected to clarify the sources of this missing heritability. As a preview of what we might learn in these more detailed assessments of genetic variation, we used sequencing to identify potentially interesting variants in seven genes associated with low-density lipoprotein cholesterol (LDL-C) in 256 Sardinian individuals with extreme LDL-C levels, followed by large scale genotyping in 5,524 individuals, to examine newly discovered and previously described variants. We found that a combination of common and rare variants in these loci contributes to variation in LDL-C levels, and also that the initial estimate of the heritability explained by these loci doubled. Importantly, our results include a Sardinian-specific rare variant, highlighting the need for sequencing studies in isolated populations. Our results provide insights about what extensive whole-genome sequencing efforts are likely to reveal for the understanding of the genetic architecture of complex traits.
The approach to molecular genetic studies of complex phenotypes has evolved considerably during the recent years. The candidate gene approach, restricted to analysis of a few single nucleotide polymorphisms (SNPs) in a modest number of cases and controls, has been supplanted by the unbiased approach of Genome-Wide Association Studies (GWAS), wherein a large number of tagger SNPs are typed in a large number of individuals. GWAS, which are designed upon the common disease- common variant hypothesis (CD-CV), have identified a large number of SNPs and loci for complex phenotypes. However, alleles identified through GWAS are typically not causative but rather in linkage disequilibrium (LD) with the true causal variants. The common alleles, which may not capture the uncommon and rare variants, account only for a fraction of heritability of the complex traits. Hence, the focus is being shifted to rare variants – common disease (RV-CD) hypothesis, surmising that rare variants exert large effect sizes on the phenotype. In conjunctional with this conceptual shift technological advances in DNA sequencing techniques have dramatically enhanced whole genome or whole exome sequencing capacity. The sequencing approach affords identification of not only the rare but also the common variants. The approach – whether used in complementation with GWAS or as a stand-alone approach - could define the genetic architecture of the complex phenotypes. Robust phenotyping and large-scale sequencing studies are essential to extract the information content of the vast number of DNA sequence variants (DSVs) in the genome. To garner meaningful clinical information and link the genotype to a phenotype, identification and characterization of a very large number of causal fields beyond the information content of DNA sequence variants would be necessary. This review provides an update on the current progress and limitations in identifying DSVs that are associated with phenotypic effects.
An increasing proportion of the worldwide population is affected by allergic diseases such as allergic rhinitis (AR), atopic dermatitis (AD) and allergic asthma and improved treatment options are needed particularly for severe, refractory disease. Allergic diseases are complex and development involves both environmental and genetic factors. Although the existence of a genetic component for allergy was first described almost 100 years ago, progress in gene identification has been hindered by lack of high throughput technologies to investigate genetic variation in large numbers of subjects. The development of Genome-Wide Association Studies (GWAS), a hypothesis-free method of interrogating large numbers of common variants spanning the entire genome in disease and non-disease subjects has revolutionised our understanding of the genetics of allergic disease. Susceptibility genes for asthma, AR and AD have now been identified with confidence, suggesting there are common and distinct genetic loci associated with these diseases, providing novel insights into potential disease pathways and mechanisms. Genes involved in both adaptive and innate immune mechanisms have been identified, notably including multiple genes involved in epithelial function/secretion, suggesting that the airway epithelium may be particularly important in asthma. Interestingly, concordance/discordance between the genetic factors driving allergic traits such as IgE levels and disease states such as asthma have further supported the accumulating evidence for heterogeneity in these diseases. While GWAS have been useful and continue to identify novel genes for allergic diseases through increased sample sizes and phenotype refinement, future approaches will integrate analyses of rare variants, epigenetic mechanisms and eQTL approaches, leading to greater insight into the genetic basis of these diseases. Gene identification will improve our understanding of disease mechanisms and generate potential therapeutic opportunities.
The recent success of genome-wide association studies (GWAS) is now followed by the challenge to determine how the reported susceptibility variants mediate complex traits and diseases. Expression quantitative trait loci (eQTLs) have been implicated in disease associations through overlaps between eQTLs and GWAS signals. However, the abundance of eQTLs and the strong correlation structure (LD) in the genome make it likely that some of these overlaps are coincidental and not driven by the same functional variants. In the present study, we propose an empirical methodology, which we call Regulatory Trait Concordance (RTC) that accounts for local LD structure and integrates eQTLs and GWAS results in order to reveal the subset of association signals that are due to cis eQTLs. We simulate genomic regions of various LD patterns with both a single or two causal variants and show that our score outperforms SNP correlation metrics, be they statistical (r2) or historical (D'). Following the observation of a significant abundance of regulatory signals among currently published GWAS loci, we apply our method with the goal to prioritize relevant genes for each of the respective complex traits. We detect several potential disease-causing regulatory effects, with a strong enrichment for immunity-related conditions, consistent with the nature of the cell line tested (LCLs). Furthermore, we present an extension of the method in trans, where interrogating the whole genome for downstream effects of the disease variant can be informative regarding its unknown primary biological effect. We conclude that integrating cellular phenotype associations with organismal complex traits will facilitate the biological interpretation of the genetic effects on these traits.
Genome-wide association studies have led to the identification of susceptibility loci for a variety of human complex traits. What is still largely missing, however, is the understanding of the biological context in which these candidate variants act and of how they determine each trait. Given the localization of many GWAS loci outside coding regions and the important role of regulatory variation in shaping phenotypic variance, gene expression has been proposed as a plausible informative intermediate phenotype. Here we show that for a subset of the currently published GWAS this is indeed the case, by observing a significant excess of regulatory variants among disease loci. We propose an empirical methodology (regulatory trait concordance—RTC) able to integrate expression and disease data in order to detect causal regulatory effects. We show that the RTC outperforms simple correlation metrics under various simulated linkage disequilibrium (LD) scenarios. Our method is able to recover previously suspected causal regulatory effects from the literature and, as expected given the nature of the tested tissue, an overrepresentation of immunity-related candidates is observed. As the number of available tissues will increase, this prioritization approach will become even more useful in understanding the implication of regulatory variants in disease etiology.
An ultimate goal of genetic research is to understand the connection between genotype and phenotype in order to improve the diagnosis and treatment of diseases. The quantitative genetics field has developed a suite of statistical methods to associate genetic loci with diseases and phenotypes, including quantitative trait loci (QTL) linkage mapping and genome-wide association studies (GWAS). However, each of these approaches have technical and biological shortcomings. For example, the amount of heritable variation explained by GWAS is often surprisingly small and the resolution of many QTL linkage mapping studies is poor. The predictive power and interpretation of QTL and GWAS results are consequently limited. In this study, we propose a complementary approach to quantitative genetics by interrogating the vast amount of high-throughput genomic data in model organisms to functionally associate genes with phenotypes and diseases. Our algorithm combines the genome-wide functional relationship network for the laboratory mouse and a state-of-the-art machine learning method. We demonstrate the superior accuracy of this algorithm through predicting genes associated with each of 1157 diverse phenotype ontology terms. Comparison between our prediction results and a meta-analysis of quantitative genetic studies reveals both overlapping candidates and distinct, accurate predictions uniquely identified by our approach. Focusing on bone mineral density (BMD), a phenotype related to osteoporotic fracture, we experimentally validated two of our novel predictions (not observed in any previous GWAS/QTL studies) and found significant bone density defects for both Timp2 and Abcg8 deficient mice. Our results suggest that the integration of functional genomics data into networks, which itself is informative of protein function and interactions, can successfully be utilized as a complementary approach to quantitative genetics to predict disease risks. All supplementary material is available at http://cbfg.jax.org/phenotype.
Many recent efforts to understand the genetic origins of complex diseases utilize statistical approaches to analyze phenotypic traits measured in genetically well-characterized populations. While these quantitative genetics methods are powerful, their success is limited by sampling biases and other confounding factors, and the biological interpretation of results can be challenging since these methods are not based on any functional information for candidate loci. On the other hand, the functional genomics field has greatly expanded in past years, both in terms of experimental approaches and analytical algorithms. However, functional approaches have been applied to understanding phenotypes in only the most basic ways. In this study, we demonstrate that functional genomics can complement traditional quantitative genetics by analytically extracting protein function information from large collections of high throughput data, which can then be used to predict genotype-phenotype associations. We applied our prediction methodology to the laboratory mouse, and we experimentally confirmed a role in osteoporosis for two of our predictions that were not candidates from any previous quantitative genetics study. The ability of our approach to produce accurate and unique predictions implies that functional genomics can complement quantitative genetics and can help address previous limitations in identifying disease genes.
Genome-wide association studies (GWAS) for type 1 diabetes (T1D) have successfully identified more than 40 independent T1D associated tagging single nucleotide polymorphisms (SNPs). However, owing to technical limitations of copy number variants (CNVs) genotyping assays, the assessment of the role of CNVs has been limited to the subset of these in high linkage disequilibrium with tag SNPs. The contribution of untagged CNVs, often multi-allelic and difficult to genotype using existing assays, to the heritability of T1D remains an open question. To investigate this issue, we designed a custom comparative genetic hybridization array (aCGH) specifically designed to assay untagged CNV loci identified from a variety of sources. To overcome the technical limitations of the case control design for this class of CNVs, we genotyped the Type 1 Diabetes Genetics Consortium (T1DGC) family resource (representing 3,903 transmissions from parents to affected offspring) and used an association testing strategy that does not necessitate obtaining discrete genotypes. Our design targeted 4,309 CNVs, of which 3,410 passed stringent quality control filters. As a positive control, the scan confirmed the known T1D association at the INS locus by direct typing of the 5′ variable number of tandem repeat (VNTR) locus. Our results clarify the fact that the disease association is indistinguishable from the two main polymorphic allele classes of the INS VNTR, class I-and class III. We also identified novel technical artifacts resulting into spurious associations at the somatically rearranging loci, T cell receptor, TCRA/TCRD and TCRB, and Immunoglobulin heavy chain, IGH, loci on chromosomes 14q11.2, 7q34 and 14q32.33, respectively. However, our data did not identify novel T1D loci. Our results do not support a major role of untagged CNVs in T1D heritability.
For many complex traits, and in particular type 1 diabetes (T1D), the genome-wide association study (GWAS) design has been successful at detecting a large number of loci that contribute disease risk. However, in the case of T1D as well as almost all other traits, the sum of these loci does not fully explain the heritability estimated from familial studies. This observation raises the possibility that additional variants exist but have not yet been found because they have not effectively been targeted by the GWAS design. Here, we focus on a specific class of large deletions/duplications called copy number variants (CNVs), and more precisely to the subset of these loci that mutate rapidly, which are highly polymorphic. A consequence of this high level of polymorphism is that these variants have typically not been captured by previous GWAS studies. We use a family based design that is optimized to capture these previously untested variants. We then perform a genome-wide scan to assess their contribution to T1D. Our scan was technically successful but did not identify novel associations. This suggests that little was missed by the GWAS strategy, and that the remaining heritability of T1D is most likely driven by a large number of variants, either rare of common, but with a small individual contribution to disease risk.
Genome-wide association studies (GWAS) have identified more than 2,000 trait-SNP associations, and the number continues to increase. GWAS have focused on traits with potential consequences for human fitness, including many immunological, metabolic, cardiovascular, and behavioral phenotypes. Given the polygenic nature of complex traits, selection may exert its influence on them by altering allele frequencies at many associated loci, a possibility which has yet to be explored empirically. Here we use 38 different measures of allele frequency variation and 8 iHS scores to characterize over 1,300 GWAS SNPs in 53 globally distributed human populations. We apply these same techniques to evaluate SNPs grouped by trait association. We find that groups of SNPs associated with pigmentation, blood pressure, infectious disease, and autoimmune disease traits exhibit unusual allele frequency patterns and elevated iHS scores in certain geographical locations. We also find that GWAS SNPs have generally elevated scores for measures of allele frequency variation and for iHS in Eurasia and East Asia. Overall, we believe that our results provide evidence for selection on several complex traits that has caused changes in allele frequencies and/or elevated iHS scores at a number of associated loci. Since GWAS SNPs collectively exhibit elevated allele frequency measures and iHS scores, selection on complex traits may be quite widespread. Our findings are most consistent with this selection being either positive or negative, although the relative contributions of the two are difficult to discern. Our results also suggest that trait-SNP associations identified in Eurasian samples may not be present in Africa, Oceania, and the Americas, possibly due to differences in linkage disequilibrium patterns. This observation suggests that non-Eurasian and non-East Asian sample populations should be included in future GWAS.
Natural selection exerts its influence by changing allele frequencies at genomic polymorphisms. Alleles associated with harmful traits decrease in frequency while those associated with beneficial traits become more common. In a simple case, selection acts on a trait controlled by a single polymorphism; a large change in allele frequency at this polymorphism can eliminate a deleterious phenotype from a population or fix a beneficial one. However, many phenotypes, including diseases like Type 2 Diabetes, Crohn's disease, and prostate cancer, and physiological traits like height, weight, and hair color, are controlled by multiple genomic loci. Selection may act on such traits by influencing allele frequencies at a single associated polymorphism or by altering allele frequencies at many associated polymorphisms. To search for cases of the latter, we assembled groups of genomic polymorphisms sharing a common trait association and examined their allele frequencies across 53 globally distributed populations looking for commonalities in allelic behavior across geographical space. We find that variants associated with blood pressure tend to correlate with latitude, while those associated with HIV/AIDS progression correlate well with longitude. We also find evidence that selection may be acting worldwide to increase the frequencies of alleles that elevate autoimmune disease risk.
Obesity is a classical complex trait, influenced by both genetic and lifestyle factors. The number of obesity gene variants is currently unknown but, based on sound evolutionary principles, likely to be many, each with a modest effect on the phenotype. Recent advances in our knowledge of variation in the human genome and high throughput genotyping technologies have made possible genome-wide association (GWA) analysis and the identification of bona fide susceptibility genes for many complex diseases and phenotypes, including obesity and its comorbid conditions. GWA analysis in even larger numbers of individuals through collaborative efforts of many investigators will likely identify those polygenes of moderate and modest effect size that manifest in our typical environment. Once the subset of real-world-relevant obesity susceptibility variants is identified, follow-up studies, including detailed molecular analysis of the loci, stratified analyses, prospective and interventional studies in humans, and mechanistic studies in cells and animals will allow us to define the genetic architecture of the locus and dissect how these genes interact with specific environmental and other factors. The molecular and analytical tools to accomplish these goals are now in hand, but cooperation among investigators will be necessary to amass the requisite numbers of phenotyped and genotyped individuals. Identification of susceptibility genes for obesity and determining how they interact with each other and the environment will lead to new insights into the molecular, cellular, and physiological basis of energy homeostasis, and novel strategies for prevention and treatment.
Migraine with and without aura (MA and MO, respectively) have a strong genetic basis. Different approaches using linkage-, candidate gene- and genome-wide association studies have been explored, yielding limited results. This may indicate that the genetic component in migraine is due to rare variants; capturing these will require more detailed sequencing in order to be discovered. Next-generation sequencing (NGS) techniques such as whole exome and whole genome sequencing have been successful in finding genes in especially monogenic disorders. As the molecular genetics research progresses, the technology will follow, rendering these approaches more applicable in the search for causative migraine genes in MO and MA. To date, no studies using NGS in migraine genetics have been published. In order to gain insight into the future possibilities of migraine genetics, we have looked at NGS studies in other diseases and have interviewed three experts in the field of genetics and complex traits. The experts’ ideas suggest that the preferred NGS approach depends on the expected effect size and the frequency of the variants of interest. Family-specific variants can be found by sequencing a small number of individuals, while a large number of unrelated cases are needed to find common and rare variants. NGS is currently hampered by high cost and technical problems concurrent with analyzing large amounts of data generated, especially by whole genome sequencing. As genome-wide association chips, exome sequencing and whole genome sequencing gradually become more affordable, these approaches will be used on a larger scale. This may reveal new risk variants in migraine which may offer previously unsuspected biological insights.
Genetics; Migraine; Migraine with aura; Next-generation sequencing; GWAS; Exome sequencing
Age-related macular degeneration (AMD) is a common condition among the elderly population that leads to the progressive central vision loss and serious compromise of quality of life for its sufferers. It is also one of the few disorders for whom the investigation of its genetics has yielded rich insights into its diversity and causality and holds the promise of enabling clinicians to provide better risk assessments for individuals as well as to develop and selectively deploy new therapeutics to either prevent or slow the development of disease and lessen the threat of vision loss. The genetics of AMD began initially with the appreciation of familial aggregation and increase risk and expanded with the initial association of APOE variants with the disease. The first major breakthroughs came with family-based linkage studies of affected (and discordant) sibs, which identified a number of genetic loci and led to the targeted search of the 1q31 and 10q26 loci for associated variants. Three of the initial four reports for the CFH variant, Y402H, were based on regional candidate searches, as were the two initial reports of the ARMS2/HTRA1 locus variants. Case-control association studies initially also played a role in discovering the major genetic variants for AMD, and the success of those early studies have been used to fuel enthusiasm for the methodology for a number of diseases. Until 2010, all of the subsequent genetic variants associated with AMD came from candidate gene testing based on the complement factor pathway. In 2010, several large-scale genome-wide association studies (GWAS) identified genes that had not been previously identified. Much of this historical information is available in a number of recent reviews.(Chen et al., 2010b; Deangelis et al., 2011; Fafowora and Gorin, 2012b; Francis and Klein, 2011; Kokotas et al., 2011) Large meta analysis of AMD GWAS has added new loci and variants to this collection.(Chen et al., 2010a; Kopplin et al., 2010; Yu et al., 2011) This paper will focus on the ongoing controversies that are confronting AMD genetics at this time, rather than attempting to summarize this field, which has exploded in the past 5 years.
molecular genetics; Age-related macular degeneration; Association studies; Family-based linkage; Risk factors; Genetics-based therapeutics
Genome-wide association studies (GWAS) have demonstrated the ability to identify the strongest causal common variants in complex human diseases. However, to date, the massive data generated from GWAS have not been maximally explored to identify true associations that fail to meet the stringent level of association required to achieve genome-wide significance. Genetics of gene expression (GGE) studies have shown promise towards identifying DNA variations associated with disease and providing a path to functionally characterize findings from GWAS. Here, we present the first empiric study to systematically characterize the set of single nucleotide polymorphisms associated with expression (eSNPs) in liver, subcutaneous fat, and omental fat tissues, demonstrating these eSNPs are significantly more enriched for SNPs that associate with type 2 diabetes (T2D) in three large-scale GWAS than a matched set of randomly selected SNPs. This enrichment for T2D association increases as we restrict to eSNPs that correspond to genes comprising gene networks constructed from adipose gene expression data isolated from a mouse population segregating a T2D phenotype. Finally, by restricting to eSNPs corresponding to genes comprising an adipose subnetwork strongly predicted as causal for T2D, we dramatically increased the enrichment for SNPs associated with T2D and were able to identify a functionally related set of diabetes susceptibility genes. We identified and validated malic enzyme 1 (Me1) as a key regulator of this T2D subnetwork in mouse and provided support for the association of this gene to T2D in humans. This integration of eSNPs and networks provides a novel approach to identify disease susceptibility networks rather than the single SNPs or genes traditionally identified through GWAS, thereby extracting additional value from the wealth of data currently being generated by GWAS.
Genome-wide association studies (GWAS) seek to identify loci in which changes in DNA are correlated with disease. However, GWAS do not necessarily lead directly to genes associated with disease, and they do not typically inform the broader context in which disease genes operate, thereby providing limited insights into the mechanisms driving disease. One critical task to providing further insights into GWAS is developing an understanding of the genetics of gene expression (GGE). We present the first empiric study demonstrating that SNPs in human cohorts that associate with gene expression in liver and adipose tissues are enriched for associating with Type 2 Diabetes (T2D) in humans. By filtering “eSNPs” based on causal gene networks defined in an experimental cross population segregating T2D traits, we demonstrate a dramatically increased enrichment of T2D SNPs that enhance our ability to assess T2D risk. We demonstrate the utility of this approach by identifying malic enzyme 1 (ME1) as a novel T2D susceptibility gene in humans and then functionally validating the causal connection between ME1 and T2D in a mouse knockout model for Me1. This approach provides a path to identifying disease susceptibility networks rather than single SNPs or genes traditionally identified through GWAS.
Type 2 diabetes is one of the most common complex diseases, of which considerable efforts have been made to unravel the pathophysiological mechanisms. Recently, large‐scale genome‐wide association (GWA) studies have successfully identified genetic loci robustly associated with type 2 diabetes by searching susceptibility variants across the entire genome in an unbiased, hypothesis‐free manner. The number of loci has climbed from just three in 2006 to approximately 70 today. For the common type 2 diabetes‐associated variants, three features have been noted. First, genetic impacts of individual variants are generally modest; mostly, allelic odds ratios range between 1.06 and 1.20. Second, most of the loci identified to date are not in or near obvious candidate genes, but some are often located in the intergenic regions. Third, although the number of loci is limited, there might be some population specificity in type 2 diabetes association. Although we can estimate a single or a few target genes for individual loci detected in GWA studies by referring to the data for experiments in vitro, biological function remains largely unknown for a substantial part of such target genes. Nevertheless, new biology is arising from GWA study discoveries; for example, genes implicated in β‐cell dysfunction are over‐represented within type 2 diabetes‐associated regions. Toward translational advances, we have just begun to face new challenges – elucidation of multifaceted (i.e., molecular, cellular and physiological) mechanistic insights into disease biology by considering interaction with the environment. The present review summarizes recent advances in the genetics of type 2 diabetes, together with its realistic potential.
Genetics; Plasma glucose; Type 2 diabetes
Recent work has shown that much of the missing heritability of complex traits can be resolved by estimates of heritability explained by all genotyped SNPs. However, it is currently unknown how much heritability is missing due to poor tagging or additional causal variants at known GWAS loci. Here, we use variance components to quantify the heritability explained by all SNPs at known GWAS loci in nine diseases from WTCCC1 and WTCCC2. After accounting for expectation, we observed all SNPs at known GWAS loci to explain more heritability than GWAS-associated SNPs on average (). For some diseases, this increase was individually significant: for Multiple Sclerosis (MS) () and for Crohn's Disease (CD) (); all analyses of autoimmune diseases excluded the well-studied MHC region. Additionally, we found that GWAS loci from other related traits also explained significant heritability. The union of all autoimmune disease loci explained more MS heritability than known MS SNPs () and more CD heritability than known CD SNPs (), with an analogous increase for all autoimmune diseases analyzed. We also observed significant increases in an analysis of Rheumatoid Arthritis (RA) samples typed on ImmunoChip, with more heritability from all SNPs at GWAS loci () and more heritability from all autoimmune disease loci () compared to known RA SNPs (including those identified in this cohort). Our methods adjust for LD between SNPs, which can bias standard estimates of heritability from SNPs even if all causal variants are typed. By comparing adjusted estimates, we hypothesize that the genome-wide distribution of causal variants is enriched for low-frequency alleles, but that causal variants at known GWAS loci are skewed towards common alleles. These findings have important ramifications for fine-mapping study design and our understanding of complex disease architecture.
Heritable diseases have an unknown underlying “genetic architecture” that defines the distribution of effect-sizes for disease-causing mutations. Understanding this genetic architecture is an important first step in designing disease-mapping studies, and many theories have been developed on the nature of this distribution. Here, we evaluate the hypothesis that additional heritable variation lies at previously known associated loci but is not fully explained by the single most associated marker. We develop methods based on variance-components analysis to quantify this type of “local” heritability, demonstrating that standard strategies can be falsely inflated or deflated due to correlation between neighboring markers and propose a robust adjustment. In analysis of nine common diseases we find a significant average increase of local heritability, consistent with multiple common causal variants at an average locus. Intriguingly, for autoimmune diseases we also observe significant local heritability in loci not associated with the specific disease but with other autoimmune diseases, implying a highly correlated underlying disease architecture. These findings have important implications to the design of future studies and our general understanding of common disease.
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Genome-wide association studies (GWAS) have found a large number of genetic regions (“loci”) affecting clinical end-points and phenotypes, many outside coding intervals. One approach to understanding the biological basis of these associations has been to explore whether GWAS signals from intermediate cellular phenotypes, in particular gene expression, are located in the same loci (“colocalise”) and are potentially mediating the disease signals. However, it is not clear how to assess whether the same variants are responsible for the two GWAS signals or whether it is distinct causal variants close to each other. In this paper, we describe a statistical method that can use simply single variant summary statistics to test for colocalisation of GWAS signals. We describe one application of our method to a meta-analysis of blood lipids and liver expression, although any two datasets resulting from association studies can be used. Our method is able to detect the subset of GWAS signals explained by regulatory effects and identify candidate genes affected by the same GWAS variants. As summary GWAS data are increasingly available, applications of colocalisation methods to integrate the findings will be essential for functional follow-up, and will also be particularly useful to identify tissue specific signals in eQTL datasets.
Following the widespread use of genome-wide association studies (GWAS), focus is turning towards identification of causal variants rather than simply genetic markers of diseases and traits. As a step towards a high-throughput method to identify genome-wide, non-coding, functional regulatory variants, we describe the technique of allele-specific FAIRE, utilising large-scale genotyping technology (FAIRE-gen) to determine allelic effects on chromatin accessibility and regulatory potential. FAIRE-gen was explored using lymphoblastoid cells and the 50,000 SNP Illumina CVD BeadChip. The technique identified an allele-specific regulatory polymorphism within NR1H3 (coding for LXR-α), rs7120118, coinciding with a previously GWAS-identified SNP for HDL-C levels. This finding was confirmed using FAIRE-gen with the 200,000 SNP Illumina Metabochip and verified with the established method of TaqMan allelic discrimination. Examination of this SNP in two prospective Caucasian cohorts comprising 15,000 individuals confirmed the association with HDL-C levels (combined beta = 0.016; p = 0.0006), and analysis of gene expression identified an allelic association with LXR-α expression in heart tissue. Using increasingly comprehensive genotyping chips and distinct tissues for examination, FAIRE-gen has the potential to aid the identification of many causal SNPs associated with disease from GWAS.
The identification of genetic variants associated with complex diseases has rapidly grown through lowering costs of genome sequencing and the use of large-scale genotyping chips based on this sequencing data. There have not been corresponding advances in the identification of causal genetic variants compared to variants simply associated with diseases or traits. Most of these causal variants are thought to be located not within regions coding for proteins, but within genomic regions that regulate the level of protein. We have combined the use of large-scale gene chips with functional analysis, to determine regions of the genome that confer a greater potential for controlling gene regulation dependent on the genotype of that individual. Combining this data with population data and gene expression data, we identify a potential causal variant that alters regulation of LXR-α, a key mediator in lipid metabolism, and show that this variant is associated with HDL-C levels. This methodology provides a model for future analyses to identify further causal variants for disease.
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
Genome-wide association studies (GWAS) have successfully identified numerous regions in the genome that harbor genetic variants that increase risk for various complex traits and diseases. However, it is generally the case that GWAS risk variants are not themselves causally affecting the trait, but rather, are correlated to the true causal variant through linkage disequilibrium (LD). Plausible causal variants are identified in fine-mapping studies through targeted sequencing followed by prioritization of variants for functional validation. In this work, we propose methods that leverage two sources of independent information, the association strength and genomic functional location, to prioritize causal variants. We demonstrate in simulations and empirical data that our approach reduces the number of SNPs that need to be selected for follow-up to identify the true causal variants at GWAS risk loci.
Genome-wide association mapping is highly sensitive to environmental changes, but network analysis allows rapid causal gene identification.
Genome-wide association (GWA) is gaining popularity as a means to study the architecture of complex quantitative traits, partially due to the improvement of high-throughput low-cost genotyping and phenotyping technologies. Glucosinolate (GSL) secondary metabolites within Arabidopsis spp. can serve as a model system to understand the genomic architecture of adaptive quantitative traits. GSL are key anti-herbivory defenses that impart adaptive advantages within field trials. While little is known about how variation in the external or internal environment of an organism may influence the efficiency of GWA, GSL variation is known to be highly dependent upon the external stresses and developmental processes of the plant lending it to be an excellent model for studying conditional GWA.
To understand how development and environment can influence GWA, we conducted a study using 96 Arabidopsis thaliana accessions, >40 GSL phenotypes across three conditions (one developmental comparison and one environmental comparison) and ∼230,000 SNPs. Developmental stage had dramatic effects on the outcome of GWA, with each stage identifying different loci associated with GSL traits. Further, while the molecular bases of numerous quantitative trait loci (QTL) controlling GSL traits have been identified, there is currently no estimate of how many additional genes may control natural variation in these traits. We developed a novel co-expression network approach to prioritize the thousands of GWA candidates and successfully validated a large number of these genes as influencing GSL accumulation within A. thaliana using single gene isogenic lines.
Together, these results suggest that complex traits imparting environmentally contingent adaptive advantages are likely influenced by up to thousands of loci that are sensitive to fluctuations in the environment or developmental state of the organism. Additionally, while GWA is highly conditional upon genetics, the use of additional genomic information can rapidly identify causal loci en masse.
Understanding how genetic variation can control phenotypic variation is a fundamental goal of modern biology. A major push has been made using genome-wide association mapping in all organisms to attempt and rapidly identify the genes contributing to phenotypes such as disease and nutritional disorders. But a number of fundamental questions have not been answered about the use of genome-wide association: for example, how does the internal or external environment influence the genes found? Furthermore, the simple question of how many genes may influence a trait is unknown. Finally, a number of studies have identified significant false-positive and -negative issues within genome-wide association studies that are not solvable by direct statistical approaches. We have used genome-wide association mapping in the plant Arabidopsis thaliana to begin exploring these questions. We show that both external and internal environments significantly alter the identified genes, such that using different tissues can lead to the identification of nearly completely different gene sets. Given the large number of potential false-positives, we developed an orthogonal approach to filtering the possible genes, by identifying co-functioning networks using the nominal candidate gene list derived from genome-wide association studies. This allowed us to rapidly identify and validate a large number of novel and unexpected genes that affect Arabidopsis thaliana defense metabolism within phenotypic ranges that have been shown to be selectable within the field. These genes and the associated networks suggest that Arabidopsis thaliana defense metabolism is more readily similar to the infinite gene hypothesis, according to which there is a vast number of causative genes controlling natural variation in this phenotype. It remains to be seen how frequently this is true for other organisms and other phenotypes.
Coronary heart disease (CHD) is the leading cause of mortality in African Americans. To identify common genetic polymorphisms associated with CHD and its risk factors (LDL- and HDL-cholesterol (LDL-C and HDL-C), hypertension, smoking, and type-2 diabetes) in individuals of African ancestry, we performed a genome-wide association study (GWAS) in 8,090 African Americans from five population-based cohorts. We replicated 17 loci previously associated with CHD or its risk factors in Caucasians. For five of these regions (CHD: CDKN2A/CDKN2B; HDL-C: FADS1-3, PLTP, LPL, and ABCA1), we could leverage the distinct linkage disequilibrium (LD) patterns in African Americans to identify DNA polymorphisms more strongly associated with the phenotypes than the previously reported index SNPs found in Caucasian populations. We also developed a new approach for association testing in admixed populations that uses allelic and local ancestry variation. Using this method, we discovered several loci that would have been missed using the basic allelic and global ancestry information only. Our conclusions suggest that no major loci uniquely explain the high prevalence of CHD in African Americans. Our project has developed resources and methods that address both admixture- and SNP-association to maximize power for genetic discovery in even larger African-American consortia.
To date, most large-scale genome-wide association studies (GWAS) carried out to identify risk factors for complex human diseases and traits have focused on population of European ancestry. It is currently unknown whether the same loci associated with complex diseases and traits in Caucasians will replicate in population of African ancestry. Here, we conducted a large GWAS to identify common DNA polymorphisms associated with coronary heart disease (CHD) and its risk factors (type-2 diabetes, hypertension, smoking status, and LDL- and HDL-cholesterol) in 8,090 African Americans as part of the NHLBI Candidate gene Association Resource (CARe) Project. We replicated 17 associations previously reported in Caucasians, suggesting that the same loci carry common DNA sequence variants associated with CHD and its risk factors in Caucasians and African Americans. At five of these 17 loci, we used the different patterns of linkage disequilibrium between populations of European and African ancestry to identify DNA sequence variants more strongly associated with phenotypes than the index SNPs found in Caucasians, suggesting smaller genomic intervals to search for causal alleles. We also used the CARe data to develop new statistical methods to perform association studies in admixed populations. The CARe Project data represent an extraordinary resource to expand our understanding of the genetics of complex diseases and traits in non-European-derived populations.
Rapid advances in sequencing technologies set the stage for the large-scale medical sequencing efforts to be performed in the near future, with the goal of assessing the importance of rare variants in complex diseases. The discovery of new disease susceptibility genes requires powerful statistical methods for rare variant analysis. The low frequency and the expected large number of such variants pose great difficulties for the analysis of these data. We propose here a robust and powerful testing strategy to study the role rare variants may play in affecting susceptibility to complex traits. The strategy is based on assessing whether rare variants in a genetic region collectively occur at significantly higher frequencies in cases compared with controls (or vice versa). A main feature of the proposed methodology is that, although it is an overall test assessing a possibly large number of rare variants simultaneously, the disease variants can be both protective and risk variants, with moderate decreases in statistical power when both types of variants are present. Using simulations, we show that this approach can be powerful under complex and general disease models, as well as in larger genetic regions where the proportion of disease susceptibility variants may be small. Comparisons with previously published tests on simulated data show that the proposed approach can have better power than the existing methods. An application to a recently published study on Type-1 Diabetes finds rare variants in gene IFIH1 to be protective against Type-1 Diabetes.
Risk to common diseases, such as diabetes, heart disease, etc., is influenced by a complex interaction among genetic and environmental factors. Most of the disease-association studies conducted so far have focused on common variants, widely available on genotyping platforms. However, recent advances in sequencing technologies pave the way for large-scale medical sequencing studies with the goal of elucidating the role rare variants may play in affecting susceptibility to complex traits. The large number of rare variants and their low frequencies pose great challenges for the analysis of these data. We present here a novel testing strategy, based on a weighted-sum statistic, that is less sensitive than existing methods to the presence of both risk and protective variants in the genetic region under investigation. We show applications to simulated data and to a real dataset on Type-1 Diabetes.
Genome-wide association studies (GWAS) have successfully identified loci associated with quantitative traits, such as blood lipids. Deep resequencing studies are being utilized to catalogue the allelic spectrum at GWAS loci. The goal of these studies is to identify causative variants and missing heritability, including heritability due to low frequency and rare alleles with large phenotypic impact. Whereas rare variant efforts have primarily focused on nonsynonymous coding variants, we hypothesized that noncoding variants in these loci are also functionally important. Using the HDL-C gene LIPG as an example, we explored the effect of regulatory variants identified through resequencing of subjects at HDL-C extremes on gene expression, protein levels, and phenotype. Resequencing a portion of the LIPG promoter and 5′ UTR in human subjects with extreme HDL-C, we identified several rare variants in individuals from both extremes. Luciferase reporter assays were used to measure the effect of these rare variants on LIPG expression. Variants conferring opposing effects on gene expression were enriched in opposite extremes of the phenotypic distribution. Minor alleles of a common regulatory haplotype and noncoding GWAS SNPs were associated with reduced plasma levels of the LIPG gene product endothelial lipase (EL), consistent with its role in HDL-C catabolism. Additionally, we found that a common nonfunctional coding variant associated with HDL-C (rs2000813) is in linkage disequilibrium with a 5′ UTR variant (rs34474737) that decreases LIPG promoter activity. We attribute the gene regulatory role of rs34474737 to the observed association of the coding variant with plasma EL levels and HDL-C. Taken together, the findings show that both rare and common noncoding regulatory variants are important contributors to the allelic spectrum in complex trait loci.
Genetic association studies have identified genomic regions that affect quantifiable traits such as lipid levels. When a gene and a trait are found to be associated with one another, the gene is often further studied to determine its role in affecting the trait. One approach is to sequence the gene in individuals at the extremes of the trait's distribution with the hope of finding rare mutations that directly contribute to the trait. Until now studies using this approach have focused on genetic variation in the protein coding sequence of these genes and have been largely successful in identifying functionally important mutations. However, other studies have found an abundance of noncoding variation in the genome that may also contribute to the heritability of these traits. Here we seek to determine the contribution of such noncoding mutations to high density lipoprotein cholesterol (HDL-C) levels in humans using the HDL-C candidate gene LIPG as an example. Through a sequencing study in individuals with high and low HDL-C levels, we demonstrate that both rare and common noncoding mutations are influential contributors to the allelic spectrum of such traits and should be further characterized after initial association with the trait.
Mitochondrial dysfunction has been observed in skeletal muscle of people with diabetes and insulin-resistant individuals. Furthermore, inherited mutations in mitochondrial DNA can cause a rare form of diabetes. However, it is unclear whether mitochondrial dysfunction is a primary cause of the common form of diabetes. To date, common genetic variants robustly associated with type 2 diabetes (T2D) are not known to affect mitochondrial function. One possibility is that multiple mitochondrial genes contain modest genetic effects that collectively influence T2D risk. To test this hypothesis we developed a method named Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA; http://www.broadinstitute.org/mpg/magenta). MAGENTA, in analogy to Gene Set Enrichment Analysis, tests whether sets of functionally related genes are enriched for associations with a polygenic disease or trait. MAGENTA was specifically designed to exploit the statistical power of large genome-wide association (GWA) study meta-analyses whose individual genotypes are not available. This is achieved by combining variant association p-values into gene scores and then correcting for confounders, such as gene size, variant number, and linkage disequilibrium properties. Using simulations, we determined the range of parameters for which MAGENTA can detect associations likely missed by single-marker analysis. We verified MAGENTA's performance on empirical data by identifying known relevant pathways in lipid and lipoprotein GWA meta-analyses. We then tested our mitochondrial hypothesis by applying MAGENTA to three gene sets: nuclear regulators of mitochondrial genes, oxidative phosphorylation genes, and ∼1,000 nuclear-encoded mitochondrial genes. The analysis was performed using the most recent T2D GWA meta-analysis of 47,117 people and meta-analyses of seven diabetes-related glycemic traits (up to 46,186 non-diabetic individuals). This well-powered analysis found no significant enrichment of associations to T2D or any of the glycemic traits in any of the gene sets tested. These results suggest that common variants affecting nuclear-encoded mitochondrial genes have at most a small genetic contribution to T2D susceptibility.
Mitochondria play a crucial role in metabolic homeostasis, and alteration of mitochondrial function is a hallmark of diabetes. While mitochondrial activity is reduced in people with diabetes, it is unclear whether mitochondrial dysfunction is a cause or effect of type 2 diabetes. Genome-wide association studies for type 2 diabetes have explained ≈10% of the heritability of the disease, but none of the loci are known to affect mitochondrial activity. It is possible though that a mitochondrial contribution is hidden in the remaining 90%. Hence, we tested the hypothesis that multiple mitochondria-related genes encoded in the nucleus, each having a weak effect (hard to detect individually), can collectively influence type 2 diabetes. To address this, we developed a computational method (MAGENTA) that allowed us to adequately analyze large collective datasets of human genetic variation obtained from collaborative studies of type 2 diabetes and related glycemic traits. Despite the increased sensitivity of MAGENTA compared to single-DNA variant analysis, we found no support for a causal relationship between mitochondrial dysfunction and type 2 diabetes. These results may help steer future efforts in understanding the pathogenesis of the disease. MAGENTA is broadly applicable to testing associations between other biological pathways and common diseases or traits.
For the past five years, genome-wide association studies (GWAS) have identified hundreds of common variants associated with human diseases and traits, including high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), and triglyceride (TG) levels. Approximately 95 loci associated with lipid levels have been identified primarily among populations of European ancestry. The Population Architecture using Genomics and Epidemiology (PAGE) study was established in 2008 to characterize GWAS–identified variants in diverse population-based studies. We genotyped 49 GWAS–identified SNPs associated with one or more lipid traits in at least two PAGE studies and across six racial/ethnic groups. We performed a meta-analysis testing for SNP associations with fasting HDL-C, LDL-C, and ln(TG) levels in self-identified European American (∼20,000), African American (∼9,000), American Indian (∼6,000), Mexican American/Hispanic (∼2,500), Japanese/East Asian (∼690), and Pacific Islander/Native Hawaiian (∼175) adults, regardless of lipid-lowering medication use. We replicated 55 of 60 (92%) SNP associations tested in European Americans at p<0.05. Despite sufficient power, we were unable to replicate ABCA1 rs4149268 and rs1883025, CETP rs1864163, and TTC39B rs471364 previously associated with HDL-C and MAFB rs6102059 previously associated with LDL-C. Based on significance (p<0.05) and consistent direction of effect, a majority of replicated genotype-phentoype associations for HDL-C, LDL-C, and ln(TG) in European Americans generalized to African Americans (48%, 61%, and 57%), American Indians (45%, 64%, and 77%), and Mexican Americans/Hispanics (57%, 56%, and 86%). Overall, 16 associations generalized across all three populations. For the associations that did not generalize, differences in effect sizes, allele frequencies, and linkage disequilibrium offer clues to the next generation of association studies for these traits.
Low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglyceride (TG) levels are well known independent risk factors for cardiovascular disease. Lipid-associated genetic variants are being discovered in genome-wide association studies (GWAS) in samples of European descent, but an insufficient amount of data exist in other populations. Therefore, there is a strong need to characterize the effect of these GWAS–identified variants in more diverse cohorts. In this study, we selected over forty genetic loci previously associated with lipid levels and tested for replication in a large European American cohort. We also investigated if the effect of these variants generalizes to non-European descent populations, including African Americans, American Indians, and Mexican Americans/Hispanics. A majority of these GWAS–identified associations replicated in our European American cohort. However, the ability of associations to generalize across other racial/ethnic populations varied greatly, indicating that some of these GWAS–identified variants may not be functional and are more likely to be in linkage disequilibrium with the functional variant(s).
Identifying the genes that influence levels of pro-inflammatory molecules can help to elucidate the mechanisms underlying this process. We first conducted a two-stage genome-wide association scan (GWAS) for the key inflammatory biomarkers Interleukin-6 (IL-6), the general measure of inflammation erythrocyte sedimentation rate (ESR), monocyte chemotactic protein-1 (MCP-1), and high-sensitivity C-reactive protein (hsCRP) in a large cohort of individuals from the founder population of Sardinia. By analysing 731,213 autosomal or X chromosome SNPs and an additional ∼1.9 million imputed variants in 4,694 individuals, we identified several SNPs associated with the selected quantitative trait loci (QTLs) and replicated all the top signals in an independent sample of 1,392 individuals from the same population. Next, to increase power to detect and resolve associations, we further genotyped the whole cohort (6,145 individuals) for 293,875 variants included on the ImmunoChip and MetaboChip custom arrays. Overall, our combined approach led to the identification of 9 genome-wide significant novel independent signals—5 of which were identified only with the custom arrays—and provided confirmatory evidence for an additional 7. Novel signals include: for IL-6, in the ABO gene (rs657152, p = 2.13×10−29); for ESR, at the HBB (rs4910472, p = 2.31×10−11) and UCN119B/SPPL3 (rs11829037, p = 8.91×10−10) loci; for MCP-1, near its receptor CCR2 (rs17141006, p = 7.53×10−13) and in CADM3 (rs3026968, p = 7.63×10−13); for hsCRP, within the CRP gene (rs3093077, p = 5.73×10−21), near DARC (rs3845624, p = 1.43×10−10), UNC119B/SPPL3 (rs11829037, p = 1.50×10−14), and ICOSLG/AIRE (rs113459440, p = 1.54×10−08) loci. Confirmatory evidence was found for IL-6 in the IL-6R gene (rs4129267); for ESR at CR1 (rs12567990) and TMEM57 (rs10903129); for MCP-1 at DARC (rs12075); and for hsCRP at CRP (rs1205), HNF1A (rs225918), and APOC-I (rs4420638). Our results improve the current knowledge of genetic variants underlying inflammation and provide novel clues for the understanding of the molecular mechanisms regulating this complex process.
Inflammation is a protective response of our organism to harmful stimuli—such as germs, damaged cells, or irritants—and to initiate the healing process. It has also been implicated, with both protective and predisposing effects, in a number of different diseases; but many important details of this complex phenomenon are still unknown. Identifying the genes that influence levels of pro-inflammatory molecules can help to elucidate the factors and mechanisms underlying inflammation and their consequence on health. Genome-wide association scans (GWAS) have proved successful in revealing robust associations in both common diseases and quantitative traits. Here, we thus performed a multistage GWAS in a large cohort of individuals from Sardinia to examine the role of common genetic variants on the key inflammatory biomarkers Interleukin-6, erythrocyte sedimentation rate, monocyte chemotactic protein-1, and high-sensitivity C-reactive protein. Our work identified new genetic determinants associated with the quantitative levels of these inflammatory biomarkers and confirmed known ones. Overall, the data highlight an intricate regulation of this complex biological phenomenon and reveal proteins and mechanisms that can now be followed up with adequate functional studies.
In the past decade, significant progress in genomic medicine and technological advances have revolutionized our approach to common complex disorders in many areas of medicine, including ophthalmology. A major disorder that still needs major genetic progress is diabetic retinopathy (DR), one of the leading causes of blindness in adults.
To perform a literature review, present the current findings, and highlight some key challenges.
Thorough literature review of the genetic factors for DR, including heritability scores, twin studies, family studies, candidate gene studies, linkage studies, and genome-wide association studies (GWAS).
While there is clear demonstration of a genetic contribution in the development and progression of DR, the identification of susceptibility loci through candidate gene approaches, linkage studies, and GWAS is still in its infancy. The greatest obstacles remain a lack of power due to small sample size of available studies and a lack of phenotype standardization. In this review, we also discuss novel technologies and novel approaches, such as intermediate phenotypes for biomarkers, proteomics, metabolomics, exome chips, and next-generation sequencing that may facilitate future studies of DR.
Conclusions and Relevance
The field of the genetics of DR is still in its infancy and is a challenge due to the complexity of the disease itself. This review outlines some strategies and lessons for future investigation to improve our understanding of this most complex of genetic disorders.
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.