PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (25)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
Document Types
1.  Using population admixture to help complete maps of the human genome 
Nature genetics  2013;45(4):406-414e2.
Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces by utilizing the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning four million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified eight large novel inter-chromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed in RNA and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.
doi:10.1038/ng.2565
PMCID: PMC3683849  PMID: 23435088
2.  Exome sequencing and the genetic basis of complex traits 
Nature genetics  2012;44(6):623-630.
Exome sequencing is emerging as a popular approach to study the effect of rare coding variants on complex phenotypes. The promise of exome sequencing is grounded in theoretical population genetics and in empirical successes of candidate gene sequencing studies. Many projects aimed at common diseases are underway, and their results are eagerly anticipated. In this Perspective, using exome sequencing data from 438 individuals, we discuss several aspects of exome sequencing studies that we view as particularly important. We review processing and quality control of raw sequence data, evaluate the statistical properties of exome sequencing studies, discuss rare variant burden tests to detect association to phenotypes, and demonstrate the importance of accounting for population stratification in the analysis of rare variants. We conclude that enthusiasm for exome sequencing studies of complex traits should be combined with the caution that thousands of samples may be required to reach sufficient statistical power.
doi:10.1038/ng.2303
PMCID: PMC3727622  PMID: 22641211
3.  New approaches to disease mapping in admixed populations 
Nature reviews. Genetics  2011;12(8):523-528.
Admixed populations such as African Americans and Hispanic Americans are often medically underserved and bear a disproportionately high burden of disease. Owing to the diversity of their genomes, these populations have both advantages and disadvantages for genetic studies of complex phenotypes. Advances in statistical methodologies that can infer genetic contributions from ancestral populations may yield new insights into the aetiology of disease and may contribute to the applicability of genomic medicine to these admixed population groups.
doi:10.1038/nrg3002
PMCID: PMC3142784  PMID: 21709689
4.  Genetic variation in RNASEL associated with prostate cancer risk and progression 
Carcinogenesis  2010;31(9):1597-1603.
Variation in genes contributing to the host immune response may mediate the relationship between inflammation and prostate carcinogenesis. RNASEL at chromosome 1q25 encodes ribonuclease L, part of the interferon-mediated immune response to viral infection. We therefore investigated the association between variation in RNASEL and prostate cancer risk and progression in a study of 1286 cases and 1264 controls nested within the prospective Physicians’ Health Study. Eleven single-nucleotide polymorphisms (SNPs) were selected using the web-based ‘Tagger’ in the HapMap CEPH panel (Utah residents of Northern and Western European Ancestry). Unconditional logistic regression models assessed the relationship between each SNP and incident advanced stage (T3/T4, T0-T4/M1 and lethal disease) and high Gleason grade (≥7) prostate cancer. Further analyses were stratified by calendar year of diagnosis. Cox proportional hazards models examined the relationship between genotype and prostate cancer-specific survival. We also explored associations between genotype and serum inflammatory biomarkers interleukin-6 (IL-6), C-reactive protein (CRP) and tumor necrosis factor-alpha receptor 2 using linear regression. Individuals homozygous for the variant allele of rs12757998 had an increased risk of prostate cancer [AA versus GG; odds ratio (OR): 1.63, 95% confidence interval (CI): 1.18–2.25), and more specifically, high-grade tumors (OR: 1.90, 95% CI: 1.25–2.89). The same genotype was associated with increased CRP (P = 0.02) and IL-6 (P = 0.05) levels. Missense mutations R462Q and D541E were associated with an increased risk of advanced stage disease only in the pre-prostate-specific antigen era. There were no significant associations with survival. The results of this study support a link between RNASEL and prostate cancer and suggest that the association may be mediated through inflammation. These novel findings warrant replication in future studies.
doi:10.1093/carcin/bgq132
PMCID: PMC2930803  PMID: 20576793
5.  The History of African Gene Flow into Southern Europeans, Levantines, and Jews 
PLoS Genetics  2011;7(4):e1001373.
Previous genetic studies have suggested a history of sub-Saharan African gene flow into some West Eurasian populations after the initial dispersal out of Africa that occurred at least 45,000 years ago. However, there has been no accurate characterization of the proportion of mixture, or of its date. We analyze genome-wide polymorphism data from about 40 West Eurasian groups to show that almost all Southern Europeans have inherited 1%–3% African ancestry with an average mixture date of around 55 generations ago, consistent with North African gene flow at the end of the Roman Empire and subsequent Arab migrations. Levantine groups harbor 4%–15% African ancestry with an average mixture date of about 32 generations ago, consistent with close political, economic, and cultural links with Egypt in the late middle ages. We also detect 3%–5% sub-Saharan African ancestry in all eight of the diverse Jewish populations that we analyzed. For the Jewish admixture, we obtain an average estimated date of about 72 generations. This may reflect descent of these groups from a common ancestral population that already had some African ancestry prior to the Jewish Diasporas.
Author Summary
Southern Europeans and Middle Eastern populations are known to have inherited a small percentage of their genetic material from recent sub-Saharan African migrations, but there has been no estimate of the exact proportion of this gene flow, or of its date. Here, we apply genomic methods to show that the proportion of African ancestry in many Southern European groups is 1%–3%, in Middle Eastern groups is 4%–15%, and in Jewish groups is 3%–5%. To estimate the dates when the mixture occurred, we develop a novel method that estimates the size of chromosomal segments of distinct ancestry in individuals of mixed ancestry. We verify using computer simulations that the method produces useful estimates of population mixture dates up to 300 generations in the past. By applying the method to West Eurasians, we show that the dates in Southern Europeans are consistent with events during the Roman Empire and subsequent Arab migrations. The dates in the Jewish groups are older, consistent with events in classical or biblical times that may have occurred in the shared history of Jewish populations.
doi:10.1371/journal.pgen.1001373
PMCID: PMC3080861  PMID: 21533020
7.  Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-by-Descent in Related or Unrelated Individuals 
PLoS Genetics  2011;7(2):e1001317.
Family studies of individual tissues have shown that gene expression traits are genetically heritable. Here, we investigate cis and trans components of heritability both within and across tissues by applying variance-components methods to 722 Icelanders from family cohorts, using identity-by-descent (IBD) estimates from long-range phased genome-wide SNP data and gene expression measurements for ∼19,000 genes in blood and adipose tissue. We estimate the proportion of gene expression heritability attributable to cis regulation as 37% in blood and 24% in adipose tissue. Our results indicate that the correlation in gene expression measurements across these tissues is primarily due to heritability at cis loci, whereas there is little sharing of trans regulation across tissues. One implication of this finding is that heritability in tissues composed of heterogeneous cell types is expected to be more dominated by cis regulation than in tissues composed of more homogeneous cell types, consistent with our blood versus adipose results as well as results of previous studies in lymphoblastoid cell lines. Finally, we obtained similar estimates of the cis components of heritability using IBD between unrelated individuals, indicating that transgenerational epigenetic inheritance does not contribute substantially to the “missing heritability” of gene expression in these tissue types.
Author Summary
An important goal in biology is to understand how genotype affects gene expression. Because gene expression varies across tissues, the relationship between genotype and gene expression may be tissue-specific. In this study, we used heritability approaches to study the regulation of gene expression in two tissue types, blood and adipose tissue, as well as the regulation of gene expression that is shared across these tissues. Heritability can be partitioned into cis and trans effects by assessing identity-by-descent (IBD) at the genomic location close to the expressed gene or genome-wide, respectively, and applying variance-components methods to partition the heritability of each gene. We estimated the proportion of gene expression heritability explained by cis regulation as 37% in blood and 24% in adipose tissue. Notably, the heritability shared across tissue types was primarily due to cis regulation. Thus, the relative contribution of cis versus trans regulation is expected to increase with the number of cell types present in the tissue being assayed, just as observed in our study and in a comparison to previous work on lymphoblastoid cell lines (LCL). We specifically ruled out a substantial contribution of transgenerational epigenetic inheritance to heritability of gene expression in these cohorts by repeating our heritability analyses using segments shared IBD in distantly related Icelanders.
doi:10.1371/journal.pgen.1001317
PMCID: PMC3044684  PMID: 21383966
8.  Reconstructing Indian Population History 
Nature  2009;461(7263):489-494.
India has been underrepresented in genome-wide surveys of human variation. We analyze 25 diverse groups to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the “Ancestral North Indians” (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, while the other, the “Ancestral South Indians” (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71% in India, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the Andamanese are an ASI-related group without ANI ancestry, showing that the peopling of the islands must have occurred before ANI-ASI gene flow on the mainland. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years due to endogamy. We therefore predict that there will be an excess of recessive diseases in India, different in each group, which should be possible to screen and map genetically.
doi:10.1038/nature08365
PMCID: PMC2842210  PMID: 19779445
9.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations 
PLoS Genetics  2009;5(6):e1000519.
Identifying the ancestry of chromosomal segments of distinct ancestry has a wide range of applications from disease mapping to learning about history. Most methods require the use of unlinked markers; but, using all markers from genome-wide scanning arrays, it should in principle be possible to infer the ancestry of even very small segments with exquisite accuracy. We describe a method, HAPMIX, which employs an explicit population genetic model to perform such local ancestry inference based on fine-scale variation data. We show that HAPMIX outperforms other methods, and we explore its utility for inferring ancestry, learning about ancestral populations, and inferring dates of admixture. We validate the method empirically by applying it to populations that have experienced recent and ancient admixture: 935 African Americans from the United States and 29 Mozabites from North Africa. HAPMIX will be of particular utility for mapping disease genes in recently admixed populations, as its accurate estimates of local ancestry permit admixture and case-control association signals to be combined, enabling more powerful tests of association than with either signal alone.
Author Summary
The genomes of individuals from admixed populations consist of chromosomal segments of distinct ancestry. For example, the genomes of African American individuals contain segments of both African and European ancestry, so that a specific location in the genome may inherit 0, 1, or 2 copies of European ancestry. Inferring an individual's local ancestry, their number of copies of each ancestry at each location in the genome, has important applications in disease mapping and in understanding human history. Here we describe HAPMIX, a method that analyzes data from dense genotyping chips to infer local ancestry with very high precision. An important feature of HAPMIX is that it makes use of data from haplotypes (blocks of nearby markers), which are more informative for ancestry than individual markers. Our simulations demonstrate the utility of HAPMIX for local ancestry inference, and empirical applications to African American and Mozabite data sets uncover important aspects of the history of these populations.
doi:10.1371/journal.pgen.1000519
PMCID: PMC2689842  PMID: 19543370
10.  The Impact of Divergence Time on the Nature of Population Structure: An Example from Iceland 
PLoS Genetics  2009;5(6):e1000505.
The Icelandic population has been sampled in many disease association studies, providing a strong motivation to understand the structure of this population and its ramifications for disease gene mapping. Previous work using 40 microsatellites showed that the Icelandic population is relatively homogeneous, but exhibits subtle population structure that can bias disease association statistics. Here, we show that regional geographic ancestries of individuals from Iceland can be distinguished using 292,289 autosomal single-nucleotide polymorphisms (SNPs). We further show that subpopulation differences are due to genetic drift since the settlement of Iceland 1100 years ago, and not to varying contributions from different ancestral populations. A consequence of the recent origin of Icelandic population structure is that allele frequency differences follow a null distribution devoid of outliers, so that the risk of false positive associations due to stratification is minimal. Our results highlight an important distinction between population differences attributable to recent drift and those arising from more ancient divergence, which has implications both for association studies and for efforts to detect natural selection using population differentiation.
Author Summary
The Icelandic population is a structured population, in that geographic regions of Iceland exhibit differences in allele frequencies of genetic markers. Although these differences are relatively small, previous work has shown that they can bias association statistics in disease studies if cases and controls are sampled in different proportions across the geographic regions. In this study, we show that by using dense genotype data it is possible to distinguish the regional geographic ancestry of individuals from Iceland. We further show that the allele frequency differences between regions of Iceland are due to genetic drift since the settling of Iceland, not to differences in contributions from ancestral populations. A consequence of this is that the allele frequency differences follow a null distribution, devoid of unusually large differences caused by the action of natural selection, so that ensuing false positive associations in disease studies will be minimal. This is in stark contrast to populations (such as European Americans) in which subpopulation differences are due to more ancient divergence, allowing the action of natural selection to produce unusually large allele frequency differences that can lead to false positive associations. Our results highlight an important distinction between population differences attributable to recent genetic drift and those arising from more ancient divergence.
doi:10.1371/journal.pgen.1000505
PMCID: PMC2684636  PMID: 19503599
11.  Two independent alleles at 6q23 associated with risk of rheumatoid arthritis 
Nature genetics  2007;39(12):1477-1482.
To identify susceptibility alleles associated with rheumatoid arthritis, we genotyped 397 individuals with rheumatoid arthritis for 116,204 SNPs and carried out an association analysis in comparison to publicly available genotype data for 1,211 related individuals from the Framingham Heart Study1. After evaluating and adjusting for technical and population biases, we identified a SNP at 6q23 (rs10499194, ∼150 kb from TNFAIP3 and OLIG3) that was reproducibly associated with rheumatoid arthritis both in the genome-wide association (GWA) scan and in 5,541 additional case-control samples (P = 10−3, GWA scan; P < 10−6, replication; P = 10−9, combined). In a concurrent study, the Wellcome Trust Case Control Consortium (WTCCC) has reported strong association of rheumatoid arthritis susceptibility to a different SNP located 3.8 kb from rs10499194 (rs6920220; P = 5 × 10−6 in WTCCC)2. We show that these two SNP associations are statistically independent, are each reproducible in the comparison of our data and WTCCC data, and define risk and protective haplotypes for rheumatoid arthritis at 6q23.
doi:10.1038/ng.2007.27
PMCID: PMC2652744  PMID: 17982456
12.  Effects of cis and trans Genetic Ancestry on Gene Expression in African Americans 
PLoS Genetics  2008;4(12):e1000294.
Variation in gene expression is a fundamental aspect of human phenotypic variation. Several recent studies have analyzed gene expression levels in populations of different continental ancestry and reported population differences at a large number of genes. However, these differences could largely be due to non-genetic (e.g., environmental) effects. Here, we analyze gene expression levels in African American cell lines, which differ from previously analyzed cell lines in that individuals from this population inherit variable proportions of two continental ancestries. We first relate gene expression levels in individual African Americans to their genome-wide proportion of European ancestry. The results provide strong evidence of a genetic contribution to expression differences between European and African populations, validating previous findings. Second, we infer local ancestry (0, 1, or 2 European chromosomes) at each location in the genome and investigate the effects of ancestry proximal to the expressed gene (cis) versus ancestry elsewhere in the genome (trans). Both effects are highly significant, and we estimate that 12±3% of all heritable variation in human gene expression is due to cis variants.
Author Summary
Variation in gene expression is a fundamental aspect of human phenotypic variation, and understanding how this variation is apportioned among human populations is an important aim. Previous studies have compared gene expression levels between distinct populations, but it is unclear whether the differences that were observed have a genetic or nongenetic basis. Admixed populations, such as African Americans, offer a solution to this problem because individuals vary in their proportion of European ancestry while the analysis of a single population minimizes nongenetic factors. Here, we show that differences in gene expression among African Americans of different ancestry proportions validate gene expression differences between European and African populations. Furthermore, by drawing a distinction between an African American individual's ancestry at the location of a gene whose expression is being analyzed (cis) versus at distal locations (trans), we can use ancestry effects to quantify the relative contributions of cis and trans regulation to human gene expression. We estimate that 12±3% of all heritable variation in human gene expression is due to cis variants.
doi:10.1371/journal.pgen.1000294
PMCID: PMC2586034  PMID: 19057673
14.  Discerning the Ancestry of European Americans in Genetic Association Studies 
PLoS Genetics  2008;4(1):e236.
European Americans are often treated as a homogeneous group, but in fact form a structured population due to historical immigration of diverse source populations. Discerning the ancestry of European Americans genotyped in association studies is important in order to prevent false-positive or false-negative associations due to population stratification and to identify genetic variants whose contribution to disease risk differs across European ancestries. Here, we investigate empirical patterns of population structure in European Americans, analyzing 4,198 samples from four genome-wide association studies to show that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the main sources of European American population structure. Building on this insight, we constructed a panel of 300 validated markers that are highly informative for distinguishing these ancestries. We demonstrate that this panel of markers can be used to correct for stratification in association studies that do not generate dense genotype data.
Author Summary
Genetic association studies analyze both phenotypes (such as disease status) and genotypes (at sites of DNA variation) of a given set of individuals. The goal of association studies is to identify DNA variants that affect disease risk or other traits of interest. However, association studies can be confounded by differences in ancestry. For example, misleading results can arise if individuals selected as disease cases have different ancestry, on average, than healthy controls. Although geographic ancestry explains only a small fraction of human genetic variation, there exist genetic variants that are much more frequent in populations with particular ancestries, and such variants would falsely appear to be related to disease. In an effort to avoid these spurious results, association studies often restrict their focus to a single continental group. European Americans are one such group that is commonly studied in the United States. Here, we analyze multiple large European American datasets to show that important differences in ancestry exist even within European Americans, and that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the major, consistent sources of variation. We provide an approach that is able to account for these ancestry differences in association studies even if only a small number of genes is studied.
doi:10.1371/journal.pgen.0030236
PMCID: PMC2211542  PMID: 18208327
16.  Discerning the Ancestry of European Americans in Genetic Association Studies 
PLoS Genetics  2008;4(1):e236.
European Americans are often treated as a homogeneous group, but in fact form a structured population due to historical immigration of diverse source populations. Discerning the ancestry of European Americans genotyped in association studies is important in order to prevent false-positive or false-negative associations due to population stratification and to identify genetic variants whose contribution to disease risk differs across European ancestries. Here, we investigate empirical patterns of population structure in European Americans, analyzing 4,198 samples from four genome-wide association studies to show that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the main sources of European American population structure. Building on this insight, we constructed a panel of 300 validated markers that are highly informative for distinguishing these ancestries. We demonstrate that this panel of markers can be used to correct for stratification in association studies that do not generate dense genotype data.
Author Summary
Genetic association studies analyze both phenotypes (such as disease status) and genotypes (at sites of DNA variation) of a given set of individuals. The goal of association studies is to identify DNA variants that affect disease risk or other traits of interest. However, association studies can be confounded by differences in ancestry. For example, misleading results can arise if individuals selected as disease cases have different ancestry, on average, than healthy controls. Although geographic ancestry explains only a small fraction of human genetic variation, there exist genetic variants that are much more frequent in populations with particular ancestries, and such variants would falsely appear to be related to disease. In an effort to avoid these spurious results, association studies often restrict their focus to a single continental group. European Americans are one such group that is commonly studied in the United States. Here, we analyze multiple large European American datasets to show that important differences in ancestry exist even within European Americans, and that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the major, consistent sources of variation. We provide an approach that is able to account for these ancestry differences in association studies even if only a small number of genes is studied.
doi:10.1371/journal.pgen.0030236
PMCID: PMC2211542  PMID: 18208327
17.  Population Structure and Eigenanalysis 
PLoS Genetics  2006;2(12):e190.
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
Synopsis
When analyzing genetic data, one often wishes to determine if the samples are from a population that has structure. Can the samples be regarded as randomly chosen from a homogeneous population, or does the data imply that the population is not genetically homogeneous? Patterson, Price, and Reich show that an old method (principal components) together with modern statistics (Tracy–Widom theory) can be combined to yield a fast and effective answer to this question. The technique is simple and practical on the largest datasets, and can be applied both to genetic markers that are biallelic and to markers that are highly polymorphic such as microsatellites. The theory also allows the authors to estimate the data size needed to detect structure if their samples are in fact from two populations that have a given, but small level of differentiation.
doi:10.1371/journal.pgen.0020190
PMCID: PMC1713260  PMID: 17194218
18.  Population Structure and Eigenanalysis 
PLoS Genetics  2006;2(12):e190.
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
Synopsis
When analyzing genetic data, one often wishes to determine if the samples are from a population that has structure. Can the samples be regarded as randomly chosen from a homogeneous population, or does the data imply that the population is not genetically homogeneous? Patterson, Price, and Reich show that an old method (principal components) together with modern statistics (Tracy–Widom theory) can be combined to yield a fast and effective answer to this question. The technique is simple and practical on the largest datasets, and can be applied both to genetic markers that are biallelic and to markers that are highly polymorphic such as microsatellites. The theory also allows the authors to estimate the data size needed to detect structure if their samples are in fact from two populations that have a given, but small level of differentiation.
doi:10.1371/journal.pgen.0020190
PMCID: PMC1713260  PMID: 17194218
19.  Identifying repeat domains in large genomes 
Genome Biology  2006;7(1):R7.
A graph-based method for the analysis of repeat families in a repeat library is presented that helps elucidating the evolutionary history of repeats.
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries.
doi:10.1186/gb-2006-7-1-r7
PMCID: PMC1431705  PMID: 16507140
20.  Quantifying Missing Heritability at Known GWAS Loci 
PLoS Genetics  2013;9(12):e1003993.
Recent work has shown that much of the missing heritability of complex traits can be resolved by estimates of heritability explained by all genotyped SNPs. However, it is currently unknown how much heritability is missing due to poor tagging or additional causal variants at known GWAS loci. Here, we use variance components to quantify the heritability explained by all SNPs at known GWAS loci in nine diseases from WTCCC1 and WTCCC2. After accounting for expectation, we observed all SNPs at known GWAS loci to explain more heritability than GWAS-associated SNPs on average (). For some diseases, this increase was individually significant: for Multiple Sclerosis (MS) () and for Crohn's Disease (CD) (); all analyses of autoimmune diseases excluded the well-studied MHC region. Additionally, we found that GWAS loci from other related traits also explained significant heritability. The union of all autoimmune disease loci explained more MS heritability than known MS SNPs () and more CD heritability than known CD SNPs (), with an analogous increase for all autoimmune diseases analyzed. We also observed significant increases in an analysis of Rheumatoid Arthritis (RA) samples typed on ImmunoChip, with more heritability from all SNPs at GWAS loci () and more heritability from all autoimmune disease loci () compared to known RA SNPs (including those identified in this cohort). Our methods adjust for LD between SNPs, which can bias standard estimates of heritability from SNPs even if all causal variants are typed. By comparing adjusted estimates, we hypothesize that the genome-wide distribution of causal variants is enriched for low-frequency alleles, but that causal variants at known GWAS loci are skewed towards common alleles. These findings have important ramifications for fine-mapping study design and our understanding of complex disease architecture.
Author Summary
Heritable diseases have an unknown underlying “genetic architecture” that defines the distribution of effect-sizes for disease-causing mutations. Understanding this genetic architecture is an important first step in designing disease-mapping studies, and many theories have been developed on the nature of this distribution. Here, we evaluate the hypothesis that additional heritable variation lies at previously known associated loci but is not fully explained by the single most associated marker. We develop methods based on variance-components analysis to quantify this type of “local” heritability, demonstrating that standard strategies can be falsely inflated or deflated due to correlation between neighboring markers and propose a robust adjustment. In analysis of nine common diseases we find a significant average increase of local heritability, consistent with multiple common causal variants at an average locus. Intriguingly, for autoimmune diseases we also observe significant local heritability in loci not associated with the specific disease but with other autoimmune diseases, implying a highly correlated underlying disease architecture. These findings have important implications to the design of future studies and our general understanding of common disease.
doi:10.1371/journal.pgen.1003993
PMCID: PMC3873246  PMID: 24385918
21.  Analysis of case–control association studies with known risk variants 
Bioinformatics  2012;28(13):1729-1737.
Motivation: The question of how to best use information from known associated variants when conducting disease association studies has yet to be answered. Some studies compute a marginal P-value for each Several Nucleotide Polymorphisms independently, ignoring previously discovered variants. Other studies include known variants as covariates in logistic regression, but a weakness of this standard conditioning strategy is that it does not account for disease prevalence and non-random ascertainment, which can induce a correlation structure between candidate variants and known associated variants even if the variants lie on different chromosomes. Here, we propose a new conditioning approach, which is based in part on the classical technique of liability threshold modeling. Roughly, this method estimates model parameters for each known variant while accounting for the published disease prevalence from the epidemiological literature.
Results: We show via simulation and application to empirical datasets that our approach outperforms both the no conditioning strategy and the standard conditioning strategy, with a properly controlled false-positive rate. Furthermore, in multiple data sets involving diseases of low prevalence, standard conditioning produces a severe drop in test statistics whereas our approach generally performs as well or better than no conditioning. Our approach may substantially improve disease gene discovery for diseases with many known risk variants.
Availability: LTSOFT software is available online http://www.hsph.harvard.edu/faculty/alkes-price/software/
Contact: nzaitlen@hsph.harvard.edu; aprice@hsph.harvard.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts259
PMCID: PMC3381970  PMID: 22556366
22.  Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits 
PLoS Genetics  2013;9(5):e1003520.
Important knowledge about the determinants of complex human phenotypes can be obtained from the estimation of heritability, the fraction of phenotypic variation in a population that is determined by genetic factors. Here, we make use of extensive phenotype data in Iceland, long-range phased genotypes, and a population-wide genealogical database to examine the heritability of 11 quantitative and 12 dichotomous phenotypes in a sample of 38,167 individuals. Most previous estimates of heritability are derived from family-based approaches such as twin studies, which may be biased upwards by epistatic interactions or shared environment. Our estimates of heritability, based on both closely and distantly related pairs of individuals, are significantly lower than those from previous studies. We examine phenotypic correlations across a range of relationships, from siblings to first cousins, and find that the excess phenotypic correlation in these related individuals is predominantly due to shared environment as opposed to dominance or epistasis. We also develop a new method to jointly estimate narrow-sense heritability and the heritability explained by genotyped SNPs. Unlike existing methods, this approach permits the use of information from both closely and distantly related pairs of individuals, thereby reducing the variance of estimates of heritability explained by genotyped SNPs while preventing upward bias. Our results show that common SNPs explain a larger proportion of the heritability than previously thought, with SNPs present on Illumina 300K genotyping arrays explaining more than half of the heritability for the 23 phenotypes examined in this study. Much of the remaining heritability is likely to be due to rare alleles that are not captured by standard genotyping arrays.
Author Summary
Phenotype is a function of a genome and its environment. Heritability is the fraction of variation in a phenotype determined by genetic factors in a population. Current methods to estimate heritability rely on the phenotypic correlations of closely related individuals and are potentially upwardly biased, due to the impact of epistasis and shared environment. We develop new methods to estimate heritability over both closely and distantly related individuals. By examining the phenotypic correlation among different types of related individuals such as siblings, half-siblings, and first cousins, we show that shared environment is the primary determinant of inflated estimates of heritability. For a large number of phenotypes, it is not known how much of the heritability is explained by SNPs included on current genotyping platforms. Existing methods to estimate this component of heritability are biased in the presence of related individuals. We develop a method that permits the inclusion of both closely and distantly related individuals when estimating heritability explained by genotyped SNPs and use it to make estimates for 23 medically relevant phenotypes. These estimates can be used to increase our understanding of the distribution and frequency of functionally relevant variants and thereby inform the design of future studies.
doi:10.1371/journal.pgen.1003520
PMCID: PMC3667752  PMID: 23737753
23.  Extremely low-coverage sequencing and imputation increases power for genome-wide association studies 
Nature genetics  2012;44(6):631-635.
Genome wide association studies (GWAS) have proven a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here we show that extremely low-coverage sequencing (0.1–0.5x) captures almost as much of the common (>5%) and low-frequency (1–5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r2 of 0.71 using off-target data (0.24x average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome sequencing datasets we show that association statistics obtained using ultra low-coverage sequencing data attain similar P-values at known associated variants as genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in ultra low-coverage sequencing can yield several times the effective sample size of SNP-array GWAS, and a commensurate increase in statistical power.
doi:10.1038/ng.2283
PMCID: PMC3400344  PMID: 22610117
24.  A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia 
Nature genetics  2009;41(2):187-191.
Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown1. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3–13), P = 2 × 10−6; replication study OR = 8.59 (3.19–25.05), P = 3 × 10−8; combined OR = 6.99 (3.68–13.57), P = 4 × 10−11) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (~4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.
doi:10.1038/ng.309
PMCID: PMC2697598  PMID: 19151713
25.  Concept, Design and Implementation of a Cardiovascular Gene-Centric 50 K SNP Array for Large-Scale Genomic Association Studies 
PLoS ONE  2008;3(10):e3583.
A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a “cosmopolitan” tagging approach to capture the genetic diversity across ∼2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions.
doi:10.1371/journal.pone.0003583
PMCID: PMC2571995  PMID: 18974833

Results 1-25 (25)