Admixed populations such as African Americans and Hispanic Americans are often medically underserved and bear a disproportionately high burden of disease. Owing to the diversity of their genomes, these populations have both advantages and disadvantages for genetic studies of complex phenotypes. Advances in statistical methodologies that can infer genetic contributions from ancestral populations may yield new insights into the aetiology of disease and may contribute to the applicability of genomic medicine to these admixed population groups.
Variation in genes contributing to the host immune response may mediate the relationship between inflammation and prostate carcinogenesis. RNASEL at chromosome 1q25 encodes ribonuclease L, part of the interferon-mediated immune response to viral infection. We therefore investigated the association between variation in RNASEL and prostate cancer risk and progression in a study of 1286 cases and 1264 controls nested within the prospective Physicians’ Health Study. Eleven single-nucleotide polymorphisms (SNPs) were selected using the web-based ‘Tagger’ in the HapMap CEPH panel (Utah residents of Northern and Western European Ancestry). Unconditional logistic regression models assessed the relationship between each SNP and incident advanced stage (T3/T4, T0-T4/M1 and lethal disease) and high Gleason grade (≥7) prostate cancer. Further analyses were stratified by calendar year of diagnosis. Cox proportional hazards models examined the relationship between genotype and prostate cancer-specific survival. We also explored associations between genotype and serum inflammatory biomarkers interleukin-6 (IL-6), C-reactive protein (CRP) and tumor necrosis factor-alpha receptor 2 using linear regression. Individuals homozygous for the variant allele of rs12757998 had an increased risk of prostate cancer [AA versus GG; odds ratio (OR): 1.63, 95% confidence interval (CI): 1.18–2.25), and more specifically, high-grade tumors (OR: 1.90, 95% CI: 1.25–2.89). The same genotype was associated with increased CRP (P = 0.02) and IL-6 (P = 0.05) levels. Missense mutations R462Q and D541E were associated with an increased risk of advanced stage disease only in the pre-prostate-specific antigen era. There were no significant associations with survival. The results of this study support a link between RNASEL and prostate cancer and suggest that the association may be mediated through inflammation. These novel findings warrant replication in future studies.
Previous genetic studies have suggested a history of sub-Saharan African gene flow into some West Eurasian populations after the initial dispersal out of Africa that occurred at least 45,000 years ago. However, there has been no accurate characterization of the proportion of mixture, or of its date. We analyze genome-wide polymorphism data from about 40 West Eurasian groups to show that almost all Southern Europeans have inherited 1%–3% African ancestry with an average mixture date of around 55 generations ago, consistent with North African gene flow at the end of the Roman Empire and subsequent Arab migrations. Levantine groups harbor 4%–15% African ancestry with an average mixture date of about 32 generations ago, consistent with close political, economic, and cultural links with Egypt in the late middle ages. We also detect 3%–5% sub-Saharan African ancestry in all eight of the diverse Jewish populations that we analyzed. For the Jewish admixture, we obtain an average estimated date of about 72 generations. This may reflect descent of these groups from a common ancestral population that already had some African ancestry prior to the Jewish Diasporas.
Southern Europeans and Middle Eastern populations are known to have inherited a small percentage of their genetic material from recent sub-Saharan African migrations, but there has been no estimate of the exact proportion of this gene flow, or of its date. Here, we apply genomic methods to show that the proportion of African ancestry in many Southern European groups is 1%–3%, in Middle Eastern groups is 4%–15%, and in Jewish groups is 3%–5%. To estimate the dates when the mixture occurred, we develop a novel method that estimates the size of chromosomal segments of distinct ancestry in individuals of mixed ancestry. We verify using computer simulations that the method produces useful estimates of population mixture dates up to 300 generations in the past. By applying the method to West Eurasians, we show that the dates in Southern Europeans are consistent with events during the Roman Empire and subsequent Arab migrations. The dates in the Jewish groups are older, consistent with events in classical or biblical times that may have occurred in the shared history of Jewish populations.
Family studies of individual tissues have shown that gene expression traits are genetically heritable. Here, we investigate cis and trans components of heritability both within and across tissues by applying variance-components methods to 722 Icelanders from family cohorts, using identity-by-descent (IBD) estimates from long-range phased genome-wide SNP data and gene expression measurements for ∼19,000 genes in blood and adipose tissue. We estimate the proportion of gene expression heritability attributable to cis regulation as 37% in blood and 24% in adipose tissue. Our results indicate that the correlation in gene expression measurements across these tissues is primarily due to heritability at cis loci, whereas there is little sharing of trans regulation across tissues. One implication of this finding is that heritability in tissues composed of heterogeneous cell types is expected to be more dominated by cis regulation than in tissues composed of more homogeneous cell types, consistent with our blood versus adipose results as well as results of previous studies in lymphoblastoid cell lines. Finally, we obtained similar estimates of the cis components of heritability using IBD between unrelated individuals, indicating that transgenerational epigenetic inheritance does not contribute substantially to the “missing heritability” of gene expression in these tissue types.
An important goal in biology is to understand how genotype affects gene expression. Because gene expression varies across tissues, the relationship between genotype and gene expression may be tissue-specific. In this study, we used heritability approaches to study the regulation of gene expression in two tissue types, blood and adipose tissue, as well as the regulation of gene expression that is shared across these tissues. Heritability can be partitioned into cis and trans effects by assessing identity-by-descent (IBD) at the genomic location close to the expressed gene or genome-wide, respectively, and applying variance-components methods to partition the heritability of each gene. We estimated the proportion of gene expression heritability explained by cis regulation as 37% in blood and 24% in adipose tissue. Notably, the heritability shared across tissue types was primarily due to cis regulation. Thus, the relative contribution of cis versus trans regulation is expected to increase with the number of cell types present in the tissue being assayed, just as observed in our study and in a comparison to previous work on lymphoblastoid cell lines (LCL). We specifically ruled out a substantial contribution of transgenerational epigenetic inheritance to heritability of gene expression in these cohorts by repeating our heritability analyses using segments shared IBD in distantly related Icelanders.
India has been underrepresented in genome-wide surveys of human variation. We analyze 25 diverse groups to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the “Ancestral North Indians” (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, while the other, the “Ancestral South Indians” (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71% in India, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the Andamanese are an ASI-related group without ANI ancestry, showing that the peopling of the islands must have occurred before ANI-ASI gene flow on the mainland. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years due to endogamy. We therefore predict that there will be an excess of recessive diseases in India, different in each group, which should be possible to screen and map genetically.
Identifying the ancestry of chromosomal segments of distinct ancestry has a wide range of applications from disease mapping to learning about history. Most methods require the use of unlinked markers; but, using all markers from genome-wide scanning arrays, it should in principle be possible to infer the ancestry of even very small segments with exquisite accuracy. We describe a method, HAPMIX, which employs an explicit population genetic model to perform such local ancestry inference based on fine-scale variation data. We show that HAPMIX outperforms other methods, and we explore its utility for inferring ancestry, learning about ancestral populations, and inferring dates of admixture. We validate the method empirically by applying it to populations that have experienced recent and ancient admixture: 935 African Americans from the United States and 29 Mozabites from North Africa. HAPMIX will be of particular utility for mapping disease genes in recently admixed populations, as its accurate estimates of local ancestry permit admixture and case-control association signals to be combined, enabling more powerful tests of association than with either signal alone.
The genomes of individuals from admixed populations consist of chromosomal segments of distinct ancestry. For example, the genomes of African American individuals contain segments of both African and European ancestry, so that a specific location in the genome may inherit 0, 1, or 2 copies of European ancestry. Inferring an individual's local ancestry, their number of copies of each ancestry at each location in the genome, has important applications in disease mapping and in understanding human history. Here we describe HAPMIX, a method that analyzes data from dense genotyping chips to infer local ancestry with very high precision. An important feature of HAPMIX is that it makes use of data from haplotypes (blocks of nearby markers), which are more informative for ancestry than individual markers. Our simulations demonstrate the utility of HAPMIX for local ancestry inference, and empirical applications to African American and Mozabite data sets uncover important aspects of the history of these populations.
The Icelandic population has been sampled in many disease association studies, providing a strong motivation to understand the structure of this population and its ramifications for disease gene mapping. Previous work using 40 microsatellites showed that the Icelandic population is relatively homogeneous, but exhibits subtle population structure that can bias disease association statistics. Here, we show that regional geographic ancestries of individuals from Iceland can be distinguished using 292,289 autosomal single-nucleotide polymorphisms (SNPs). We further show that subpopulation differences are due to genetic drift since the settlement of Iceland 1100 years ago, and not to varying contributions from different ancestral populations. A consequence of the recent origin of Icelandic population structure is that allele frequency differences follow a null distribution devoid of outliers, so that the risk of false positive associations due to stratification is minimal. Our results highlight an important distinction between population differences attributable to recent drift and those arising from more ancient divergence, which has implications both for association studies and for efforts to detect natural selection using population differentiation.
The Icelandic population is a structured population, in that geographic regions of Iceland exhibit differences in allele frequencies of genetic markers. Although these differences are relatively small, previous work has shown that they can bias association statistics in disease studies if cases and controls are sampled in different proportions across the geographic regions. In this study, we show that by using dense genotype data it is possible to distinguish the regional geographic ancestry of individuals from Iceland. We further show that the allele frequency differences between regions of Iceland are due to genetic drift since the settling of Iceland, not to differences in contributions from ancestral populations. A consequence of this is that the allele frequency differences follow a null distribution, devoid of unusually large differences caused by the action of natural selection, so that ensuing false positive associations in disease studies will be minimal. This is in stark contrast to populations (such as European Americans) in which subpopulation differences are due to more ancient divergence, allowing the action of natural selection to produce unusually large allele frequency differences that can lead to false positive associations. Our results highlight an important distinction between population differences attributable to recent genetic drift and those arising from more ancient divergence.
To identify susceptibility alleles associated with rheumatoid arthritis, we genotyped 397 individuals with rheumatoid arthritis for 116,204 SNPs and carried out an association analysis in comparison to publicly available genotype data for 1,211 related individuals from the Framingham Heart Study1. After evaluating and adjusting for technical and population biases, we identified a SNP at 6q23 (rs10499194, ∼150 kb from TNFAIP3 and OLIG3) that was reproducibly associated with rheumatoid arthritis both in the genome-wide association (GWA) scan and in 5,541 additional case-control samples (P = 10−3, GWA scan; P < 10−6, replication; P = 10−9, combined). In a concurrent study, the Wellcome Trust Case Control Consortium (WTCCC) has reported strong association of rheumatoid arthritis susceptibility to a different SNP located 3.8 kb from rs10499194 (rs6920220; P = 5 × 10−6 in WTCCC)2. We show that these two SNP associations are statistically independent, are each reproducible in the comparison of our data and WTCCC data, and define risk and protective haplotypes for rheumatoid arthritis at 6q23.
Variation in gene expression is a fundamental aspect of human phenotypic variation. Several recent studies have analyzed gene expression levels in populations of different continental ancestry and reported population differences at a large number of genes. However, these differences could largely be due to non-genetic (e.g., environmental) effects. Here, we analyze gene expression levels in African American cell lines, which differ from previously analyzed cell lines in that individuals from this population inherit variable proportions of two continental ancestries. We first relate gene expression levels in individual African Americans to their genome-wide proportion of European ancestry. The results provide strong evidence of a genetic contribution to expression differences between European and African populations, validating previous findings. Second, we infer local ancestry (0, 1, or 2 European chromosomes) at each location in the genome and investigate the effects of ancestry proximal to the expressed gene (cis) versus ancestry elsewhere in the genome (trans). Both effects are highly significant, and we estimate that 12±3% of all heritable variation in human gene expression is due to cis variants.
Variation in gene expression is a fundamental aspect of human phenotypic variation, and understanding how this variation is apportioned among human populations is an important aim. Previous studies have compared gene expression levels between distinct populations, but it is unclear whether the differences that were observed have a genetic or nongenetic basis. Admixed populations, such as African Americans, offer a solution to this problem because individuals vary in their proportion of European ancestry while the analysis of a single population minimizes nongenetic factors. Here, we show that differences in gene expression among African Americans of different ancestry proportions validate gene expression differences between European and African populations. Furthermore, by drawing a distinction between an African American individual's ancestry at the location of a gene whose expression is being analyzed (cis) versus at distal locations (trans), we can use ancestry effects to quantify the relative contributions of cis and trans regulation to human gene expression. We estimate that 12±3% of all heritable variation in human gene expression is due to cis variants.
European Americans are often treated as a homogeneous group, but in fact form a structured population due to historical immigration of diverse source populations. Discerning the ancestry of European Americans genotyped in association studies is important in order to prevent false-positive or false-negative associations due to population stratification and to identify genetic variants whose contribution to disease risk differs across European ancestries. Here, we investigate empirical patterns of population structure in European Americans, analyzing 4,198 samples from four genome-wide association studies to show that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the main sources of European American population structure. Building on this insight, we constructed a panel of 300 validated markers that are highly informative for distinguishing these ancestries. We demonstrate that this panel of markers can be used to correct for stratification in association studies that do not generate dense genotype data.
Genetic association studies analyze both phenotypes (such as disease status) and genotypes (at sites of DNA variation) of a given set of individuals. The goal of association studies is to identify DNA variants that affect disease risk or other traits of interest. However, association studies can be confounded by differences in ancestry. For example, misleading results can arise if individuals selected as disease cases have different ancestry, on average, than healthy controls. Although geographic ancestry explains only a small fraction of human genetic variation, there exist genetic variants that are much more frequent in populations with particular ancestries, and such variants would falsely appear to be related to disease. In an effort to avoid these spurious results, association studies often restrict their focus to a single continental group. European Americans are one such group that is commonly studied in the United States. Here, we analyze multiple large European American datasets to show that important differences in ancestry exist even within European Americans, and that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the major, consistent sources of variation. We provide an approach that is able to account for these ancestry differences in association studies even if only a small number of genes is studied.
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
When analyzing genetic data, one often wishes to determine if the samples are from a population that has structure. Can the samples be regarded as randomly chosen from a homogeneous population, or does the data imply that the population is not genetically homogeneous? Patterson, Price, and Reich show that an old method (principal components) together with modern statistics (Tracy–Widom theory) can be combined to yield a fast and effective answer to this question. The technique is simple and practical on the largest datasets, and can be applied both to genetic markers that are biallelic and to markers that are highly polymorphic such as microsatellites. The theory also allows the authors to estimate the data size needed to detect structure if their samples are in fact from two populations that have a given, but small level of differentiation.
A graph-based method for the analysis of repeat families in a repeat library is presented that helps elucidating the evolutionary history of repeats.
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries.
Genome wide association studies (GWAS) have proven a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here we show that extremely low-coverage sequencing (0.1–0.5x) captures almost as much of the common (>5%) and low-frequency (1–5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r2 of 0.71 using off-target data (0.24x average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome sequencing datasets we show that association statistics obtained using ultra low-coverage sequencing data attain similar P-values at known associated variants as genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in ultra low-coverage sequencing can yield several times the effective sample size of SNP-array GWAS, and a commensurate increase in statistical power.
Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown1. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3–13), P = 2 × 10−6; replication study OR = 8.59 (3.19–25.05), P = 3 × 10−8; combined OR = 6.99 (3.68–13.57), P = 4 × 10−11) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (~4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.
A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a “cosmopolitan” tagging approach to capture the genetic diversity across ∼2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions.