Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
Genome-wide association studies (GWAS) have successfully identified numerous regions in the genome that harbor genetic variants that increase risk for various complex traits and diseases. However, it is generally the case that GWAS risk variants are not themselves causally affecting the trait, but rather, are correlated to the true causal variant through linkage disequilibrium (LD). Plausible causal variants are identified in fine-mapping studies through targeted sequencing followed by prioritization of variants for functional validation. In this work, we propose methods that leverage two sources of independent information, the association strength and genomic functional location, to prioritize causal variants. We demonstrate in simulations and empirical data that our approach reduces the number of SNPs that need to be selected for follow-up to identify the true causal variants at GWAS risk loci.
Asthma originates from genetic and environmental factors with about half the risk of disease attributable to heritable causes. Genome-wide association studies, mostly in populations of European ancestry, have identified numerous asthma-associated single nucleotide polymorphisms (SNPs). Studies in populations with diverse ancestries allow both for identification of robust associations that replicate across ethnic groups and for improved resolution of associated loci due to different patterns of linkage disequilibrium between ethnic groups. Here we report on an analysis of 745 African-American subjects with asthma and 3,238 African-American control subjects from the Candidate Gene Association Resource (CARe) Consortium, including analysis of SNPs imputed using 1,000 Genomes reference panels and adjustment for local ancestry. We show strong evidence that variation near RAD50/IL13, implicated in studies of European ancestry individuals, replicates in individuals largely of African ancestry. Fine mapping in African ancestry populations also refined the variants of interest for this association. We also provide strong or nominal evidence of replication at loci near ORMDL3/GSDMB, IL1RLML18R1, and 10pl4, all previously associated with asthma in European or Japanese populations, but not at the PYHIN1 locus previously reported in studies of African-American samples. These results improve the understanding of asthma genetics and further demonstrate the utility of genetic studies in populations other than those of largely European ancestry.
Mixed linear models are emerging as a method of choice for conducting genetic association studies in humans and other organisms. The advantages of mixed linear model association (MLMA) include preventing false-positive associations due to population or relatedness structure, and increasing power by applying a correction that is specific to this structure. An underappreciated point is that MLMA can also increase power in studies without sample structure, by implicitly conditioning on associated loci other than the candidate locus. Numerous variations on the standard MLMA approach have recently been published, with a focus on reducing computational cost. These advances provide researchers applying MLMA methods with many options to choose from, but we caution that MLMA methods are still subject to potential pitfalls. Here, we describe and quantify the advantages and pitfalls of MLMA methods as a function of study design, and provide recommendations for the application of these methods in practical settings.
The success of genome-wide association studies has led to increasing interest in making predictions of complex trait phenotypes including disease from genotype data. Rigorous assessment of the value of predictors is critical before implementation. Here we discuss some of the limitations and pitfalls of prediction analysis and show how naïve implementations can lead to severe bias and misinterpretation of results.
Motivation: Inference of ancestry using genetic data is motivated by applications in genetic association studies, population genetics and personal genomics. Here, we provide methods and software for improved ancestry inference using genome-wide single nucleotide polymorphism (SNP) weights from external reference panels. This approach makes it possible to leverage the rich ancestry information that is available from large external reference panels, without the administrative and computational complexities of re-analyzing the raw genotype data from the reference panel in subsequent studies.
Results: We extensively validate our approach in multiple African American, Latino American and European American datasets, making use of genome-wide SNP weights derived from large reference panels, including HapMap 3 populations and 6546 European Americans from the Framingham Heart Study. We show empirically that our approach provides much greater accuracy than either the prevailing ancestry-informative marker (AIM) approach or the analysis of genome-wide target genotypes without a reference panel. For example, in an independent set of 1636 European American genome-wide association study samples, we attained prediction accuracy (R2) of 1.000 and 0.994 for the first two principal components using our method, compared with 0.418 and 0.407 using 150 published AIMs or 0.955 and 0.003 by applying principal component analysis directly to the target samples. We finally show that the higher accuracy in inferring ancestry using our method leads to more effective correction for population stratification in association studies.
Availability: The SNPweights software is available online at http://www.hsph.harvard.edu/faculty/alkes-price/software/.
firstname.lastname@example.org or email@example.com.
Supplementary data are available at Bioinformatics online.
Using a reduced subset of SNPs in a linear mixed model can improve power for genome-wide association studies, yet this can result in insufficient correction for population stratification. We propose a hybrid approach using principal components that does not inflate statistics in the presence of population stratification and improves power over standard linear mixed models.
mixed models; population stratification; GWAS
Recent work has shown that much of the missing heritability of complex traits can be resolved by estimates of heritability explained by all genotyped SNPs. However, it is currently unknown how much heritability is missing due to poor tagging or additional causal variants at known GWAS loci. Here, we use variance components to quantify the heritability explained by all SNPs at known GWAS loci in nine diseases from WTCCC1 and WTCCC2. After accounting for expectation, we observed all SNPs at known GWAS loci to explain more heritability than GWAS-associated SNPs on average (). For some diseases, this increase was individually significant: for Multiple Sclerosis (MS) () and for Crohn's Disease (CD) (); all analyses of autoimmune diseases excluded the well-studied MHC region. Additionally, we found that GWAS loci from other related traits also explained significant heritability. The union of all autoimmune disease loci explained more MS heritability than known MS SNPs () and more CD heritability than known CD SNPs (), with an analogous increase for all autoimmune diseases analyzed. We also observed significant increases in an analysis of Rheumatoid Arthritis (RA) samples typed on ImmunoChip, with more heritability from all SNPs at GWAS loci () and more heritability from all autoimmune disease loci () compared to known RA SNPs (including those identified in this cohort). Our methods adjust for LD between SNPs, which can bias standard estimates of heritability from SNPs even if all causal variants are typed. By comparing adjusted estimates, we hypothesize that the genome-wide distribution of causal variants is enriched for low-frequency alleles, but that causal variants at known GWAS loci are skewed towards common alleles. These findings have important ramifications for fine-mapping study design and our understanding of complex disease architecture.
Heritable diseases have an unknown underlying “genetic architecture” that defines the distribution of effect-sizes for disease-causing mutations. Understanding this genetic architecture is an important first step in designing disease-mapping studies, and many theories have been developed on the nature of this distribution. Here, we evaluate the hypothesis that additional heritable variation lies at previously known associated loci but is not fully explained by the single most associated marker. We develop methods based on variance-components analysis to quantify this type of “local” heritability, demonstrating that standard strategies can be falsely inflated or deflated due to correlation between neighboring markers and propose a robust adjustment. In analysis of nine common diseases we find a significant average increase of local heritability, consistent with multiple common causal variants at an average locus. Intriguingly, for autoimmune diseases we also observe significant local heritability in loci not associated with the specific disease but with other autoimmune diseases, implying a highly correlated underlying disease architecture. These findings have important implications to the design of future studies and our general understanding of common disease.
Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces by utilizing the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning four million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified eight large novel inter-chromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed in RNA and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.
Exome sequencing is emerging as a popular approach to study the effect of rare coding variants on complex phenotypes. The promise of exome sequencing is grounded in theoretical population genetics and in empirical successes of candidate gene sequencing studies. Many projects aimed at common diseases are underway, and their results are eagerly anticipated. In this Perspective, using exome sequencing data from 438 individuals, we discuss several aspects of exome sequencing studies that we view as particularly important. We review processing and quality control of raw sequence data, evaluate the statistical properties of exome sequencing studies, discuss rare variant burden tests to detect association to phenotypes, and demonstrate the importance of accounting for population stratification in the analysis of rare variants. We conclude that enthusiasm for exome sequencing studies of complex traits should be combined with the caution that thousands of samples may be required to reach sufficient statistical power.
Motivation: The question of how to best use information from known associated variants when conducting disease association studies has yet to be answered. Some studies compute a marginal P-value for each Several Nucleotide Polymorphisms independently, ignoring previously discovered variants. Other studies include known variants as covariates in logistic regression, but a weakness of this standard conditioning strategy is that it does not account for disease prevalence and non-random ascertainment, which can induce a correlation structure between candidate variants and known associated variants even if the variants lie on different chromosomes. Here, we propose a new conditioning approach, which is based in part on the classical technique of liability threshold modeling. Roughly, this method estimates model parameters for each known variant while accounting for the published disease prevalence from the epidemiological literature.
Results: We show via simulation and application to empirical datasets that our approach outperforms both the no conditioning strategy and the standard conditioning strategy, with a properly controlled false-positive rate. Furthermore, in multiple data sets involving diseases of low prevalence, standard conditioning produces a severe drop in test statistics whereas our approach generally performs as well or better than no conditioning. Our approach may substantially improve disease gene discovery for diseases with many known risk variants.
Availability: LTSOFT software is available online http://www.hsph.harvard.edu/faculty/alkes-price/software/
Supplementary information: Supplementary data are available at Bioinformatics online.
Important knowledge about the determinants of complex human phenotypes can be obtained from the estimation of heritability, the fraction of phenotypic variation in a population that is determined by genetic factors. Here, we make use of extensive phenotype data in Iceland, long-range phased genotypes, and a population-wide genealogical database to examine the heritability of 11 quantitative and 12 dichotomous phenotypes in a sample of 38,167 individuals. Most previous estimates of heritability are derived from family-based approaches such as twin studies, which may be biased upwards by epistatic interactions or shared environment. Our estimates of heritability, based on both closely and distantly related pairs of individuals, are significantly lower than those from previous studies. We examine phenotypic correlations across a range of relationships, from siblings to first cousins, and find that the excess phenotypic correlation in these related individuals is predominantly due to shared environment as opposed to dominance or epistasis. We also develop a new method to jointly estimate narrow-sense heritability and the heritability explained by genotyped SNPs. Unlike existing methods, this approach permits the use of information from both closely and distantly related pairs of individuals, thereby reducing the variance of estimates of heritability explained by genotyped SNPs while preventing upward bias. Our results show that common SNPs explain a larger proportion of the heritability than previously thought, with SNPs present on Illumina 300K genotyping arrays explaining more than half of the heritability for the 23 phenotypes examined in this study. Much of the remaining heritability is likely to be due to rare alleles that are not captured by standard genotyping arrays.
Phenotype is a function of a genome and its environment. Heritability is the fraction of variation in a phenotype determined by genetic factors in a population. Current methods to estimate heritability rely on the phenotypic correlations of closely related individuals and are potentially upwardly biased, due to the impact of epistasis and shared environment. We develop new methods to estimate heritability over both closely and distantly related individuals. By examining the phenotypic correlation among different types of related individuals such as siblings, half-siblings, and first cousins, we show that shared environment is the primary determinant of inflated estimates of heritability. For a large number of phenotypes, it is not known how much of the heritability is explained by SNPs included on current genotyping platforms. Existing methods to estimate this component of heritability are biased in the presence of related individuals. We develop a method that permits the inclusion of both closely and distantly related individuals when estimating heritability explained by genotyped SNPs and use it to make estimates for 23 medically relevant phenotypes. These estimates can be used to increase our understanding of the distribution and frequency of functionally relevant variants and thereby inform the design of future studies.
The peopling of the Americas has been the subject of extensive genetic, archaeological and linguistic research; however, central questions remain unresolved1–5. One contentious issue is whether the settlement occurred via a single6–8 or multiple streams of migration from Siberia9–15. The pattern of dispersals within the Americas is also poorly understood. To address these questions at higher resolution than was previously possible, we assembled data from 52 Native American and 17 Siberian groups genotyped at 364,470 single nucleotide polymorphisms. We show that Native Americans descend from at least three streams of Asian gene flow. Most descend entirely from a single ancestral population that we call “First American”. However, speakers of Eskimo-Aleut languages from the Arctic inherit almost half their ancestry from a second stream of Asian gene flow, and the Na-Dene-speaking Chipewyan from Canada inherit roughly one-tenth of their ancestry from a third stream. We show that the initial peopling followed a southward expansion facilitated by the coast, with sequential population splits and little gene flow after divergence, especially in South America. A major exception is in Chibchan-speakers on both sides of the Panama Isthmus, who have ancestry from both North and South America.
Genome wide association studies (GWAS) have proven a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here we show that extremely low-coverage sequencing (0.1–0.5x) captures almost as much of the common (>5%) and low-frequency (1–5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r2 of 0.71 using off-target data (0.24x average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome sequencing datasets we show that association statistics obtained using ultra low-coverage sequencing data attain similar P-values at known associated variants as genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in ultra low-coverage sequencing can yield several times the effective sample size of SNP-array GWAS, and a commensurate increase in statistical power.
Genetic case-control association studies often include data on clinical covariates, such as body mass index (BMI), smoking status, or age, that may modify the underlying genetic risk of case or control samples. For example, in type 2 diabetes, odds ratios for established variants estimated from low–BMI cases are larger than those estimated from high–BMI cases. An unanswered question is how to use this information to maximize statistical power in case-control studies that ascertain individuals on the basis of phenotype (case-control ascertainment) or phenotype and clinical covariates (case-control-covariate ascertainment). While current approaches improve power in studies with random ascertainment, they often lose power under case-control ascertainment and fail to capture available power increases under case-control-covariate ascertainment. We show that an informed conditioning approach, based on the liability threshold model with parameters informed by external epidemiological information, fully accounts for disease prevalence and non-random ascertainment of phenotype as well as covariates and provides a substantial increase in power while maintaining a properly controlled false-positive rate. Our method outperforms standard case-control association tests with or without covariates, tests of gene x covariate interaction, and previously proposed tests for dealing with covariates in ascertained data, with especially large improvements in the case of case-control-covariate ascertainment. We investigate empirical case-control studies of type 2 diabetes, prostate cancer, lung cancer, breast cancer, rheumatoid arthritis, age-related macular degeneration, and end-stage kidney disease over a total of 89,726 samples. In these datasets, informed conditioning outperforms logistic regression for 115 of the 157 known associated variants investigated (P-value = 1×10−9). The improvement varied across diseases with a 16% median increase in χ2 test statistics and a commensurate increase in power. This suggests that applying our method to existing and future association studies of these diseases may identify novel disease loci.
This work describes a new methodology for analyzing genome-wide case-control association studies of diseases with strong correlations to clinical covariates, such as age in prostate cancer and body mass index in type 2 diabetes. Currently, researchers either ignore these clinical covariates or apply approaches that ignore the disease's prevalence and the study's ascertainment strategy. We take an alternative approach, leveraging external prevalence information from the epidemiological literature and constructing a statistic based on the classic liability threshold model of disease. Our approach not only improves the power of studies that ascertain individuals randomly or based on the disease phenotype, but also improves the power of studies that ascertain individuals based on both the disease phenotype and clinical covariates. We apply our statistic to seven datasets over six different diseases and a variety of clinical covariates. We found that there was a substantial improvement in test statistics relative to current approaches at known associated variants. This suggests that novel loci may be identified by applying our method to existing and future association studies of these diseases.
Recombination, together with mutation, is the ultimate source of genetic variation in populations. We leverage the recent mixture of people of African and European ancestry in the Americas to build a genetic map measuring the probability of crossing-over at each position in the genome, based on about 2.1 million crossovers in 30,000 unrelated African Americans. At intervals of more than three megabases it is nearly identical to a map built in Europeans. At finer scales it differs significantly, and we identify about 2,500 recombination hotspots that are active in people of West African ancestry but nearly inactive in Europeans. The probability of a crossover at these hotspots is almost fully controlled by the alleles an individual carries at PRDM9 (P<10−245). We identify a 17 base pair DNA sequence motif that is enriched in these hotspots, and is an excellent match to the predicted binding target of African-enriched alleles of PRDM9.
Admixed populations such as African Americans and Hispanic Americans are often medically underserved and bear a disproportionately high burden of disease. Owing to the diversity of their genomes, these populations have both advantages and disadvantages for genetic studies of complex phenotypes. Advances in statistical methodologies that can infer genetic contributions from ancestral populations may yield new insights into the aetiology of disease and may contribute to the applicability of genomic medicine to these admixed population groups.
Variation in genes contributing to the host immune response may mediate the relationship between inflammation and prostate carcinogenesis. RNASEL at chromosome 1q25 encodes ribonuclease L, part of the interferon-mediated immune response to viral infection. We therefore investigated the association between variation in RNASEL and prostate cancer risk and progression in a study of 1286 cases and 1264 controls nested within the prospective Physicians’ Health Study. Eleven single-nucleotide polymorphisms (SNPs) were selected using the web-based ‘Tagger’ in the HapMap CEPH panel (Utah residents of Northern and Western European Ancestry). Unconditional logistic regression models assessed the relationship between each SNP and incident advanced stage (T3/T4, T0-T4/M1 and lethal disease) and high Gleason grade (≥7) prostate cancer. Further analyses were stratified by calendar year of diagnosis. Cox proportional hazards models examined the relationship between genotype and prostate cancer-specific survival. We also explored associations between genotype and serum inflammatory biomarkers interleukin-6 (IL-6), C-reactive protein (CRP) and tumor necrosis factor-alpha receptor 2 using linear regression. Individuals homozygous for the variant allele of rs12757998 had an increased risk of prostate cancer [AA versus GG; odds ratio (OR): 1.63, 95% confidence interval (CI): 1.18–2.25), and more specifically, high-grade tumors (OR: 1.90, 95% CI: 1.25–2.89). The same genotype was associated with increased CRP (P = 0.02) and IL-6 (P = 0.05) levels. Missense mutations R462Q and D541E were associated with an increased risk of advanced stage disease only in the pre-prostate-specific antigen era. There were no significant associations with survival. The results of this study support a link between RNASEL and prostate cancer and suggest that the association may be mediated through inflammation. These novel findings warrant replication in future studies.
Previous genetic studies have suggested a history of sub-Saharan African gene flow into some West Eurasian populations after the initial dispersal out of Africa that occurred at least 45,000 years ago. However, there has been no accurate characterization of the proportion of mixture, or of its date. We analyze genome-wide polymorphism data from about 40 West Eurasian groups to show that almost all Southern Europeans have inherited 1%–3% African ancestry with an average mixture date of around 55 generations ago, consistent with North African gene flow at the end of the Roman Empire and subsequent Arab migrations. Levantine groups harbor 4%–15% African ancestry with an average mixture date of about 32 generations ago, consistent with close political, economic, and cultural links with Egypt in the late middle ages. We also detect 3%–5% sub-Saharan African ancestry in all eight of the diverse Jewish populations that we analyzed. For the Jewish admixture, we obtain an average estimated date of about 72 generations. This may reflect descent of these groups from a common ancestral population that already had some African ancestry prior to the Jewish Diasporas.
Southern Europeans and Middle Eastern populations are known to have inherited a small percentage of their genetic material from recent sub-Saharan African migrations, but there has been no estimate of the exact proportion of this gene flow, or of its date. Here, we apply genomic methods to show that the proportion of African ancestry in many Southern European groups is 1%–3%, in Middle Eastern groups is 4%–15%, and in Jewish groups is 3%–5%. To estimate the dates when the mixture occurred, we develop a novel method that estimates the size of chromosomal segments of distinct ancestry in individuals of mixed ancestry. We verify using computer simulations that the method produces useful estimates of population mixture dates up to 300 generations in the past. By applying the method to West Eurasians, we show that the dates in Southern Europeans are consistent with events during the Roman Empire and subsequent Arab migrations. The dates in the Jewish groups are older, consistent with events in classical or biblical times that may have occurred in the shared history of Jewish populations.
While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.
This paper presents improved methodologies for the analysis of genome-wide association studies in admixed populations, which are populations that came about by the mixing of two or more distant continental populations over a few hundred years (e.g., African Americans or Latinos). Studies of admixed populations offer the promise of capturing additional genetic diversity compared to studies over homogeneous populations such as Europeans. In admixed populations, correlation between genetic variants exists both at a fine scale in the ancestral populations and at a coarse scale due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered either one or the other type of correlation, but not both. In this work we develop novel statistical methods that account for both types of genetic correlation, and we show that the combined approach attains greater statistical power than that achieved by applying either approach separately. We provide analysis of simulated and real data from major studies performed in African-American men and women to show the improvement obtained by our methods over the standard methods for analyzing association studies in admixed populations.
Family studies of individual tissues have shown that gene expression traits are genetically heritable. Here, we investigate cis and trans components of heritability both within and across tissues by applying variance-components methods to 722 Icelanders from family cohorts, using identity-by-descent (IBD) estimates from long-range phased genome-wide SNP data and gene expression measurements for ∼19,000 genes in blood and adipose tissue. We estimate the proportion of gene expression heritability attributable to cis regulation as 37% in blood and 24% in adipose tissue. Our results indicate that the correlation in gene expression measurements across these tissues is primarily due to heritability at cis loci, whereas there is little sharing of trans regulation across tissues. One implication of this finding is that heritability in tissues composed of heterogeneous cell types is expected to be more dominated by cis regulation than in tissues composed of more homogeneous cell types, consistent with our blood versus adipose results as well as results of previous studies in lymphoblastoid cell lines. Finally, we obtained similar estimates of the cis components of heritability using IBD between unrelated individuals, indicating that transgenerational epigenetic inheritance does not contribute substantially to the “missing heritability” of gene expression in these tissue types.
An important goal in biology is to understand how genotype affects gene expression. Because gene expression varies across tissues, the relationship between genotype and gene expression may be tissue-specific. In this study, we used heritability approaches to study the regulation of gene expression in two tissue types, blood and adipose tissue, as well as the regulation of gene expression that is shared across these tissues. Heritability can be partitioned into cis and trans effects by assessing identity-by-descent (IBD) at the genomic location close to the expressed gene or genome-wide, respectively, and applying variance-components methods to partition the heritability of each gene. We estimated the proportion of gene expression heritability explained by cis regulation as 37% in blood and 24% in adipose tissue. Notably, the heritability shared across tissue types was primarily due to cis regulation. Thus, the relative contribution of cis versus trans regulation is expected to increase with the number of cell types present in the tissue being assayed, just as observed in our study and in a comparison to previous work on lymphoblastoid cell lines (LCL). We specifically ruled out a substantial contribution of transgenerational epigenetic inheritance to heritability of gene expression in these cohorts by repeating our heritability analyses using segments shared IBD in distantly related Icelanders.
India has been underrepresented in genome-wide surveys of human variation. We analyze 25 diverse groups to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the “Ancestral North Indians” (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, while the other, the “Ancestral South Indians” (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71% in India, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the Andamanese are an ASI-related group without ANI ancestry, showing that the peopling of the islands must have occurred before ANI-ASI gene flow on the mainland. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years due to endogamy. We therefore predict that there will be an excess of recessive diseases in India, different in each group, which should be possible to screen and map genetically.
Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown1. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3–13), P = 2 × 10−6; replication study OR = 8.59 (3.19–25.05), P = 3 × 10−8; combined OR = 6.99 (3.68–13.57), P = 4 × 10−11) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (~4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.
Identifying the ancestry of chromosomal segments of distinct ancestry has a wide range of applications from disease mapping to learning about history. Most methods require the use of unlinked markers; but, using all markers from genome-wide scanning arrays, it should in principle be possible to infer the ancestry of even very small segments with exquisite accuracy. We describe a method, HAPMIX, which employs an explicit population genetic model to perform such local ancestry inference based on fine-scale variation data. We show that HAPMIX outperforms other methods, and we explore its utility for inferring ancestry, learning about ancestral populations, and inferring dates of admixture. We validate the method empirically by applying it to populations that have experienced recent and ancient admixture: 935 African Americans from the United States and 29 Mozabites from North Africa. HAPMIX will be of particular utility for mapping disease genes in recently admixed populations, as its accurate estimates of local ancestry permit admixture and case-control association signals to be combined, enabling more powerful tests of association than with either signal alone.
The genomes of individuals from admixed populations consist of chromosomal segments of distinct ancestry. For example, the genomes of African American individuals contain segments of both African and European ancestry, so that a specific location in the genome may inherit 0, 1, or 2 copies of European ancestry. Inferring an individual's local ancestry, their number of copies of each ancestry at each location in the genome, has important applications in disease mapping and in understanding human history. Here we describe HAPMIX, a method that analyzes data from dense genotyping chips to infer local ancestry with very high precision. An important feature of HAPMIX is that it makes use of data from haplotypes (blocks of nearby markers), which are more informative for ancestry than individual markers. Our simulations demonstrate the utility of HAPMIX for local ancestry inference, and empirical applications to African American and Mozabite data sets uncover important aspects of the history of these populations.