PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1185546)

Clipboard (0)
None

Related Articles

1.  Patterns of Cis Regulatory Variation in Diverse Human Populations 
PLoS Genetics  2012;8(4):e1002639.
The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants, but also more recently to assist in the interpretation and elucidation of disease signals. To date, many studies have looked in specific tissues and population-based samples, but there has been limited assessment of the degree of inter-population variability in regulatory variation. We analyzed genome-wide gene expression in lymphoblastoid cell lines from a total of 726 individuals from 8 global populations from the HapMap3 project and correlated gene expression levels with HapMap3 SNPs located in cis to the genes. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We further dissect the specific functional pathways differentiated between populations. We also identify 5,691 expression quantitative trait loci (eQTLs) after controlling for both non-genetic factors and population admixture and observe that half of the cis-eQTLs are replicated in one or more of the populations. We highlight patterns of eQTL-sharing between populations, which are partially determined by population genetic relatedness, and discover significant sharing of eQTL effects between Asians, European-admixed, and African subpopulations. Specifically, we observe that both the effect size and the direction of effect for eQTLs are highly conserved across populations. We observe an increasing proximity of eQTLs toward the transcription start site as sharing of eQTLs among populations increases, highlighting that variants close to TSS have stronger effects and therefore are more likely to be detected across a wider panel of populations. Together these results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation and provide an estimate for the transferability of complex trait variants across populations.
Author Summary
Variation among individuals in the degree to which genes are expressed (i.e. turned on or off) is a characteristic exhibited by all species, and studies have identified regions of the genome harboring genetic variation affecting gene expression levels. To assess the degree of human inter-population variability in regulatory variation, we describe mapping of regions of the genome that have functional effects on gene expression levels. We analyzed genome-wide gene expression in human cell lines derived from 726 unrelated individuals representing 8 global populations that have been genetically well-characterized by the International HapMap Project. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We identify ∼5,700 genes whose expression levels are associated with genetic variation located physically close to the gene, and we observe significant sharing of associations that is partially dependent on population genetic relatedness, among Asians, European-admixed, and African subpopulations. We identify biological functions affected by regulatory variation and describe common and unique characteristics of population-specific and population-shared associations. These results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation.
doi:10.1371/journal.pgen.1002639
PMCID: PMC3330104  PMID: 22532805
2.  Widespread Site-Dependent Buffering of Human Regulatory Polymorphism 
PLoS Genetics  2012;8(3):e1002599.
The average individual is expected to harbor thousands of variants within non-coding genomic regions involved in gene regulation. However, it is currently not possible to interpret reliably the functional consequences of genetic variation within any given transcription factor recognition sequence. To address this, we comprehensively analyzed heritable genome-wide binding patterns of a major sequence-specific regulator (CTCF) in relation to genetic variability in binding site sequences across a multi-generational pedigree. We localized and quantified CTCF occupancy by ChIP-seq in 12 related and unrelated individuals spanning three generations, followed by comprehensive targeted resequencing of the entire CTCF–binding landscape across all individuals. We identified hundreds of variants with reproducible quantitative effects on CTCF occupancy (both positive and negative). While these effects paralleled protein–DNA recognition energetics when averaged, they were extensively buffered by striking local context dependencies. In the significant majority of cases buffering was complete, resulting in silent variants spanning every position within the DNA recognition interface irrespective of level of binding energy or evolutionary constraint. The prevalence of complex partial or complete buffering effects severely constrained the ability to predict reliably the impact of variation within any given binding site instance. Surprisingly, 40% of variants that increased CTCF occupancy occurred at positions of human–chimp divergence, challenging the expectation that the vast majority of functional regulatory variants should be deleterious. Our results suggest that, even in the presence of “perfect” genetic information afforded by resequencing and parallel studies in multiple related individuals, genomic site-specific prediction of the consequences of individual variation in regulatory DNA will require systematic coupling with empirical functional genomic measurements.
Author Summary
A comprehensive understanding of the contribution of individual genome sequences to disease and quantitative traits will require the general ability to predict consequences of genetic variation in non-protein-coding regions, particularly those involved in gene regulation. Here we tested the power to predict such consequences when presented with “complete” information encompassing the genomic DNA binding site patterns of a well-studied regulatory protein across multiple related individuals, coupled with all individual genome sequences at the binding positions. We find that, while there is reasonable ability to predict the average effects of variation within the consensus recognition sequence of a transcriptional regulator, it is not possible to determine reliably the consequences of variation at any given genomic instance. This suggests that the interpretation of individual genome sequences will require comprehensive complementation with functional genomic studies.
doi:10.1371/journal.pgen.1002599
PMCID: PMC3310774  PMID: 22457641
3.  Mining the LIPG Allelic Spectrum Reveals the Contribution of Rare and Common Regulatory Variants to HDL Cholesterol 
PLoS Genetics  2011;7(12):e1002393.
Genome-wide association studies (GWAS) have successfully identified loci associated with quantitative traits, such as blood lipids. Deep resequencing studies are being utilized to catalogue the allelic spectrum at GWAS loci. The goal of these studies is to identify causative variants and missing heritability, including heritability due to low frequency and rare alleles with large phenotypic impact. Whereas rare variant efforts have primarily focused on nonsynonymous coding variants, we hypothesized that noncoding variants in these loci are also functionally important. Using the HDL-C gene LIPG as an example, we explored the effect of regulatory variants identified through resequencing of subjects at HDL-C extremes on gene expression, protein levels, and phenotype. Resequencing a portion of the LIPG promoter and 5′ UTR in human subjects with extreme HDL-C, we identified several rare variants in individuals from both extremes. Luciferase reporter assays were used to measure the effect of these rare variants on LIPG expression. Variants conferring opposing effects on gene expression were enriched in opposite extremes of the phenotypic distribution. Minor alleles of a common regulatory haplotype and noncoding GWAS SNPs were associated with reduced plasma levels of the LIPG gene product endothelial lipase (EL), consistent with its role in HDL-C catabolism. Additionally, we found that a common nonfunctional coding variant associated with HDL-C (rs2000813) is in linkage disequilibrium with a 5′ UTR variant (rs34474737) that decreases LIPG promoter activity. We attribute the gene regulatory role of rs34474737 to the observed association of the coding variant with plasma EL levels and HDL-C. Taken together, the findings show that both rare and common noncoding regulatory variants are important contributors to the allelic spectrum in complex trait loci.
Author Summary
Genetic association studies have identified genomic regions that affect quantifiable traits such as lipid levels. When a gene and a trait are found to be associated with one another, the gene is often further studied to determine its role in affecting the trait. One approach is to sequence the gene in individuals at the extremes of the trait's distribution with the hope of finding rare mutations that directly contribute to the trait. Until now studies using this approach have focused on genetic variation in the protein coding sequence of these genes and have been largely successful in identifying functionally important mutations. However, other studies have found an abundance of noncoding variation in the genome that may also contribute to the heritability of these traits. Here we seek to determine the contribution of such noncoding mutations to high density lipoprotein cholesterol (HDL-C) levels in humans using the HDL-C candidate gene LIPG as an example. Through a sequencing study in individuals with high and low HDL-C levels, we demonstrate that both rare and common noncoding mutations are influential contributors to the allelic spectrum of such traits and should be further characterized after initial association with the trait.
doi:10.1371/journal.pgen.1002393
PMCID: PMC3234219  PMID: 22174694
4.  Disease-Associated Mutations That Alter the RNA Structural Ensemble 
PLoS Genetics  2010;6(8):e1001074.
Genome-wide association studies (GWAS) often identify disease-associated mutations in intergenic and non-coding regions of the genome. Given the high percentage of the human genome that is transcribed, we postulate that for some observed associations the disease phenotype is caused by a structural rearrangement in a regulatory region of the RNA transcript. To identify such mutations, we have performed a genome-wide analysis of all known disease-associated Single Nucleotide Polymorphisms (SNPs) from the Human Gene Mutation Database (HGMD) that map to the untranslated regions (UTRs) of a gene. Rather than using minimum free energy approaches (e.g. mFold), we use a partition function calculation that takes into consideration the ensemble of possible RNA conformations for a given sequence. We identified in the human genome disease-associated SNPs that significantly alter the global conformation of the UTR to which they map. For six disease-states (Hyperferritinemia Cataract Syndrome, β-Thalassemia, Cartilage-Hair Hypoplasia, Retinoblastoma, Chronic Obstructive Pulmonary Disease (COPD), and Hypertension), we identified multiple SNPs in UTRs that alter the mRNA structural ensemble of the associated genes. Using a Boltzmann sampling procedure for sub-optimal RNA structures, we are able to characterize and visualize the nature of the conformational changes induced by the disease-associated mutations in the structural ensemble. We observe in several cases (specifically the 5′ UTRs of FTL and RB1) SNP–induced conformational changes analogous to those observed in bacterial regulatory Riboswitches when specific ligands bind. We propose that the UTR and SNP combinations we identify constitute a “RiboSNitch,” that is a regulatory RNA in which a specific SNP has a structural consequence that results in a disease phenotype. Our SNPfold algorithm can help identify RiboSNitches by leveraging GWAS data and an analysis of the mRNA structural ensemble.
Author Summary
Genome-wide association studies identify mutations in the human genome that correlate with a particular disease. It is common to find mutations associated with disease in the non-coding region of the genome. These non-coding mutations are more difficult to interpret at a molecular level, because they do not affect the protein sequence. In this study, we analyze disease-associated mutations in non-coding regions of our genome in the context of their structural effect on the message of genetic information in our cells, Ribonucleic Acid (RNA). We focus in particular on the regulatory parts of our genes known as untranslated regions. We find that certain disease-associated mutations in these regulatory untranslated regions have a significant effect on the structure of the RNA message. We call these elements “RiboSNitches,” because they act like switches turning on and off genes, but are caused by Single Nucleotide Polymorphisms (SNPs), which are single point mutations in our genome. The RiboSNitches we identify are potentially a new class of pharmaceutical targets, as it is possible to change the structure of RNA with small drug-like molecules.
doi:10.1371/journal.pgen.1001074
PMCID: PMC2924325  PMID: 20808897
5.  Modifier Effects between Regulatory and Protein-Coding Variation 
PLoS Genetics  2008;4(10):e1000244.
Genome-wide associations have shown a lot of promise in dissecting the genetics of complex traits in humans with single variants, yet a large fraction of the genetic effects is still unaccounted for. Analyzing genetic interactions between variants (epistasis) is one of the potential ways forward. We investigated the abundance and functional impact of a specific type of epistasis, namely the interaction between regulatory and protein-coding variants. Using genotype and gene expression data from the 210 unrelated individuals of the original four HapMap populations, we have explored the combined effects of regulatory and protein-coding single nucleotide polymorphisms (SNPs). We predict that about 18% (1,502 out of 8,233 nsSNPs) of protein-coding variants are differentially expressed among individuals and demonstrate that regulatory variants can modify the functional effect of a coding variant in cis. Furthermore, we show that such interactions in cis can affect the expression of downstream targets of the gene containing the protein-coding SNP. In this way, a cis interaction between regulatory and protein-coding variants has a trans impact on gene expression. Given the abundance of both types of variants in human populations, we propose that joint consideration of regulatory and protein-coding variants may reveal additional genetic effects underlying complex traits and disease and may shed light on causes of differential penetrance of known disease variants.
Author Summary
The ultimate goal of genome-wide association studies (GWAS) is to explain the proportion of variation in a phenotypic trait that can be attributed to genetic factors. The past two years have seen a plethora of successes in this field, yet, for most traits, a large fraction of variation remains unexplained. Epistasis, or interaction between genetic variants, is a largely under-explored factor, which may shed some light in this area. We use the HapMap populations to investigate interactions between regulatory and protein-coding variants and their impact on gene expression. We show that if a specific protein-coding variant has a functional impact, this can be modified by a co-segregating regulatory variant (cis interaction). Furthermore, the authors demonstrate that such modification effects between variants at one locus may affect the expression of other genes in the cell in a trans manner. The aim of this article is to present a framework though which variation can be considered in the context of GWAS. Viewing variation from this underappreciated angle may, in some cases, provide an explanation for differential penetrance of complex disease traits, but also for non-replication of GWAS results that may arise as a consequence of such interactions.
doi:10.1371/journal.pgen.1000244
PMCID: PMC2570624  PMID: 18974877
6.  Differential Allelic Expression in the Human Genome: A Robust Approach To Identify Genetic and Epigenetic Cis-Acting Mechanisms Regulating Gene Expression 
PLoS Genetics  2008;4(2):e1000006.
The recent development of whole genome association studies has lead to the robust identification of several loci involved in different common human diseases. Interestingly, some of the strongest signals of association observed in these studies arise from non-coding regions located in very large introns or far away from any annotated genes, raising the possibility that these regions are involved in the etiology of the disease through some unidentified regulatory mechanisms. These findings highlight the importance of better understanding the mechanisms leading to inter-individual differences in gene expression in humans. Most of the existing approaches developed to identify common regulatory polymorphisms are based on linkage/association mapping of gene expression to genotypes. However, these methods have some limitations, notably their cost and the requirement of extensive genotyping information from all the individuals studied which limits their applications to a specific cohort or tissue. Here we describe a robust and high-throughput method to directly measure differences in allelic expression for a large number of genes using the Illumina Allele-Specific Expression BeadArray platform and quantitative sequencing of RT-PCR products. We show that this approach allows reliable identification of differences in the relative expression of the two alleles larger than 1.5-fold (i.e., deviations of the allelic ratio larger than 60∶40) and offers several advantages over the mapping of total gene expression, particularly for studying humans or outbred populations. Our analysis of more than 80 individuals for 2,968 SNPs located in 1,380 genes confirms that differential allelic expression is a widespread phenomenon affecting the expression of 20% of human genes and shows that our method successfully captures expression differences resulting from both genetic and epigenetic cis-acting mechanisms.
Author Summary
We describe a new methodology to identify individual differences in the expression of the two copies of one gene. This is achieved by comparing the mRNA level of the two alleles using a heterozygous polymorphism in the transcript as marker. We show that this approach allows an exhaustive survey of cis-acting regulation in the genome; we can identify allelic expression differences due to epigenetic mechanisms of gene regulation (e.g. imprinting or X-inactivation) as well as differences due to the presence of polymorphisms in regulatory elements. The direct comparison of the expression of both alleles nullifies possible trans-acting regulatory effects (that influence equally both alleles) and thus complements the findings from gene expression association studies. Our approach can be easily applied to any cohort of interest for a wide range of studies. It notably allows following up association signals and testing whether a gene sitting on a particular haplotype is over- or under-expressed, or can be used for screening cancer tissues for aberrant gene expression due to newly arisen mutations or alteration of the methylation patterns.
doi:10.1371/journal.pgen.1000006
PMCID: PMC2265535  PMID: 18454203
7.  Genetic interactions affecting human gene expression identified by variance association mapping 
eLife  2014;3:e01381.
Non-additive interaction between genetic variants, or epistasis, is a possible explanation for the gap between heritability of complex traits and the variation explained by identified genetic loci. Interactions give rise to genotype dependent variance, and therefore the identification of variance quantitative trait loci can be an intermediate step to discover both epistasis and gene by environment effects (GxE). Using RNA-sequence data from lymphoblastoid cell lines (LCLs) from the TwinsUK cohort, we identify a candidate set of 508 variance associated SNPs. Exploiting the twin design we show that GxE plays a role in ∼70% of these associations. Further investigation of these loci reveals 57 epistatic interactions that replicated in a smaller dataset, explaining on average 4.3% of phenotypic variance. In 24 cases, more variance is explained by the interaction than their additive contributions. Using molecular phenotypes in this way may provide a route to uncovering genetic interactions underlying more complex traits.
DOI: http://dx.doi.org/10.7554/eLife.01381.001
eLife digest
Every person has two copies of each gene: one is inherited from their mother and the other from their father. These two copies are often not identical because there can be many different variants of the same gene in the human population. Traits (such as height, body mass and risk of disease) vary from one person to the next—and for many traits this variation depends in part on the different gene variants that each person has inherited. Studies seeking to find the differences in DNA that can predict this variation have often assumed that the changes in DNA act on traits independently of the effect of environment and of other genetic variants.
In contrast, studies with animals have shown that some genetic variants can interact to produce a bigger (or smaller) effect than would be expected from simply ‘adding together’ their individual effects—a phenomenon called epistasis. But how much does epistasis contribute to variation in human traits, if at all? This question has been much disputed, and is difficult to test, not least because of the sheer number of interactions to assess: tens of millions of changes in DNA have been observed in the human genome, and so there are many more than billions of possible combinations of these changes to investigate.
Here, Brown et al. have examined the sequences of all the genes that were expressed in cells taken from a cohort of twins and searched for genetic variants that show these epistatic interactions. By studying gene expression, which can be greatly affected by small changes in the DNA code, Brown et al. were able to identify 508 variants that had a bigger than expected effect on the level of gene expression. This may be a sign that these variants act in combinations: if within one genome a variant increased expression and in another it decreased expression, then this would cause greater variation in gene expression. Further investigation of these 508 variants led to the discovery of 256 examples of epistasis, and 57 of these were replicated in samples from another cohort. Brown et al. calculated that these epistatic interactions explained up to 16% of the variation in gene expression. Furthermore, as well as being involved in epistatic interactions, about 70% of the genetic variants that had an effect on the variation in gene expression were also involved in interactions between genes and the environment.
In addition to showing that epistasis contributes to variation in human traits, the work of Brown et al. could help to uncover interactions behind complex traits—beyond the expression level of a gene—that could not previously be investigated.
DOI: http://dx.doi.org/10.7554/eLife.01381.002
doi:10.7554/eLife.01381
PMCID: PMC4017648  PMID: 24771767
gene expression; epistasis; gene-environment interactions; human
8.  Genome-Wide Associations of Gene Expression Variation in Humans 
PLoS Genetics  2005;1(6):e78.
The exploration of quantitative variation in human populations has become one of the major priorities for medical genetics. The successful identification of variants that contribute to complex traits is highly dependent on reliable assays and genetic maps. We have performed a genome-wide quantitative trait analysis of 630 genes in 60 unrelated Utah residents with ancestry from Northern and Western Europe using the publicly available phase I data of the International HapMap project. The genes are located in regions of the human genome with elevated functional annotation and disease interest including the ENCODE regions spanning 1% of the genome, Chromosome 21 and Chromosome 20q12–13.2. We apply three different methods of multiple test correction, including Bonferroni, false discovery rate, and permutations. For the 374 expressed genes, we find many regions with statistically significant association of single nucleotide polymorphisms (SNPs) with expression variation in lymphoblastoid cell lines after correcting for multiple tests. Based on our analyses, the signal proximal (cis-) to the genes of interest is more abundant and more stable than distal and trans across statistical methodologies. Our results suggest that regulatory polymorphism is widespread in the human genome and show that the 5-kb (phase I) HapMap has sufficient density to enable linkage disequilibrium mapping in humans. Such studies will significantly enhance our ability to annotate the non-coding part of the genome and interpret functional variation. In addition, we demonstrate that the HapMap cell lines themselves may serve as a useful resource for quantitative measurements at the cellular level.
Synopsis
With the finished reference sequence of the human genome now available, focus has shifted towards trying to identify all of the functional elements within the sequence. Although quite a lot of progress has been made towards identifying some classes of genomic elements, in particular protein-coding sequences, the characterization of regulatory elements remains a challenge. The authors describe the genetic mapping of regions of the genome that have functional effects on quantitative levels of gene expression. Gene expression of 630 genes was measured in cell lines derived from 60 unrelated human individuals, the same Utah residents of Northern and Western European ancestry that have been genetically well-characterized by The International HapMap Project. This paper reports significant variation among individuals with respect to levels of gene expression, and demonstrates that this quantitative trait has a genetic basis. For some genes, the genetic signal was localized to specific locations in the human genome sequence; in most cases the genomic region associated with expression variation was physically close to the gene whose expression it regulated. The authors demonstrate the feasibility of performing whole-genome association scans to map quantitative traits, and highlight statistical issues that are increasingly important for whole-genome disease mapping studies.
doi:10.1371/journal.pgen.0010078
PMCID: PMC1315281  PMID: 16362079
9.  Integrative Modeling of eQTLs and Cis-Regulatory Elements Suggests Mechanisms Underlying Cell Type Specificity of eQTLs 
PLoS Genetics  2013;9(8):e1003649.
Genetic variants in cis-regulatory elements or trans-acting regulators frequently influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To fully exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and causal mechanisms of cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in three parts: first we identified eQTLs from eleven studies on seven cell types; then we integrated eQTL data with cis-regulatory element (CRE) data from the ENCODE project; finally we built a set of classifiers to predict the cell type specificity of eQTLs. The cell type specificity of eQTLs is associated with eQTL SNP overlap with hundreds of cell type specific CRE classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. These associations provide insight into the molecular mechanisms generating the cell type specificity of eQTLs and the mode of regulation of corresponding eQTLs. Using a random forest classifier with cell specific CRE-SNP overlap as features, we demonstrate the feasibility of predicting the cell type specificity of eQTLs. We then demonstrate that CREs from a trait-associated cell type can be used to annotate GWAS associations in the absence of eQTL data for that cell type. We anticipate that such integrative, predictive modeling of cell specificity will improve our ability to understand the mechanistic basis of human complex phenotypic variation.
Author Summary
When interpreting genome-wide association studies showing that specific genetic variants are associated with disease risk, scientists look for a link between the genetic variant and a biological mechanism behind that disease. One functional mechanism is that the genetic variant may influence gene transcription via a co-localized genomic regulatory element, such as a transcription factor binding site within an open chromatin region. Often this type of regulation occurs in some cell types but not others. In this study, we look across eleven gene expression studies with seven cell types and consider how genetic transcription regulators, or eQTLs, replicate within and between cell types. We identify pervasive allelic heterogeneity, or transcriptional control of a single gene by multiple, independent eQTLs. We integrate extensive data on cell type specific regulatory elements from ENCODE to identify general methods of transcription regulation through enrichment of eQTLs within regulatory elements. We also build a classifier to predict eQTL replication across cell types. The results in this paper present a path to an integrative, predictive approach to improve our ability to understand the mechanistic basis of human phenotypic variation.
doi:10.1371/journal.pgen.1003649
PMCID: PMC3731231  PMID: 23935528
10.  Genetics and Regulatory Impact of Alternative Polyadenylation in Human B-Lymphoblastoid Cells 
PLoS Genetics  2012;8(8):e1002882.
Gene expression varies widely between individuals of a population, and regulatory change can underlie phenotypes of evolutionary and biomedical relevance. A key question in the field is how DNA sequence variants impact gene expression, with most mechanistic studies to date focused on the effects of genetic change on regulatory regions upstream of protein-coding sequence. By contrast, the role of RNA 3′-end processing in regulatory variation remains largely unknown, owing in part to the challenge of identifying functional elements in 3′ untranslated regions. In this work, we conducted a genomic survey of transcript ends in lymphoblastoid cells from genetically distinct human individuals. Our analysis mapped the cis-regulatory architecture of 3′ gene ends, finding that transcript end positions did not fall randomly in untranslated regions, but rather preferentially flanked the locations of 3′ regulatory elements, including miRNA sites. The usage of these transcript length forms and motifs varied across human individuals, and polymorphisms in polyadenylation signals and other 3′ motifs were significant predictors of expression levels of the genes in which they lay. Independent single-gene experiments confirmed the effects of polyadenylation variants on steady-state expression of their respective genes, and validated the regulatory function of 3′ cis-regulatory sequence elements that mediated expression of these distinct RNA length forms. Focusing on the immune regulator IRF5, we established the effect of natural variation in RNA 3′-end processing on regulatory response to antigen stimulation. Our results underscore the importance of two mechanisms at play in the genetics of 3′-end variation: the usage of distinct 3′-end processing signals and the effects of 3′ sequence elements that determine transcript fate. Our findings suggest that the strategy of integrating observed 3′-end positions with inferred 3′ regulatory motifs will prove to be a critical tool in continued efforts to interpret human genome variation.
Author Summary
Messenger RNAs carry the instructions necessary to synthesize proteins that do work for the cell. Extending beyond the protein-coding sequence of a given mRNA is an additional stretch of sequence, harboring signals that govern how much protein is made and how long the mRNA remains in the cell before it is broken down. The incorporation of this end region into mature mRNA is itself subject to change; for the vast majority of human genes, how and why cells use different mRNA ends remains largely unknown. In this work, we surveyed mRNA ends from ∼10,000 genes in immune cells from genetically distinct human individuals. We found that mRNA end positions were not randomly distributed, but rather preferentially flanked the locations of regulatory signals that govern mRNA fate. The usage of these mRNA length forms and regulatory elements varied across individuals and could be dissected molecularly. Our results uncover key mechanisms and regulatory effects of transcript end processing, particularly as these are perturbed by genetic differences between humans.
doi:10.1371/journal.pgen.1002882
PMCID: PMC3420953  PMID: 22916029
11.  Candidate Causal Regulatory Effects by Integration of Expression QTLs with Complex Trait Genetic Associations 
PLoS Genetics  2010;6(4):e1000895.
The recent success of genome-wide association studies (GWAS) is now followed by the challenge to determine how the reported susceptibility variants mediate complex traits and diseases. Expression quantitative trait loci (eQTLs) have been implicated in disease associations through overlaps between eQTLs and GWAS signals. However, the abundance of eQTLs and the strong correlation structure (LD) in the genome make it likely that some of these overlaps are coincidental and not driven by the same functional variants. In the present study, we propose an empirical methodology, which we call Regulatory Trait Concordance (RTC) that accounts for local LD structure and integrates eQTLs and GWAS results in order to reveal the subset of association signals that are due to cis eQTLs. We simulate genomic regions of various LD patterns with both a single or two causal variants and show that our score outperforms SNP correlation metrics, be they statistical (r2) or historical (D'). Following the observation of a significant abundance of regulatory signals among currently published GWAS loci, we apply our method with the goal to prioritize relevant genes for each of the respective complex traits. We detect several potential disease-causing regulatory effects, with a strong enrichment for immunity-related conditions, consistent with the nature of the cell line tested (LCLs). Furthermore, we present an extension of the method in trans, where interrogating the whole genome for downstream effects of the disease variant can be informative regarding its unknown primary biological effect. We conclude that integrating cellular phenotype associations with organismal complex traits will facilitate the biological interpretation of the genetic effects on these traits.
Author Summary
Genome-wide association studies have led to the identification of susceptibility loci for a variety of human complex traits. What is still largely missing, however, is the understanding of the biological context in which these candidate variants act and of how they determine each trait. Given the localization of many GWAS loci outside coding regions and the important role of regulatory variation in shaping phenotypic variance, gene expression has been proposed as a plausible informative intermediate phenotype. Here we show that for a subset of the currently published GWAS this is indeed the case, by observing a significant excess of regulatory variants among disease loci. We propose an empirical methodology (regulatory trait concordance—RTC) able to integrate expression and disease data in order to detect causal regulatory effects. We show that the RTC outperforms simple correlation metrics under various simulated linkage disequilibrium (LD) scenarios. Our method is able to recover previously suspected causal regulatory effects from the literature and, as expected given the nature of the tested tissue, an overrepresentation of immunity-related candidates is observed. As the number of available tissues will increase, this prioritization approach will become even more useful in understanding the implication of regulatory variants in disease etiology.
doi:10.1371/journal.pgen.1000895
PMCID: PMC2848550  PMID: 20369022
12.  The Genetic Signatures of Noncoding RNAs 
PLoS Genetics  2009;5(4):e1000459.
The majority of the genome in animals and plants is transcribed in a developmentally regulated manner to produce large numbers of non–protein-coding RNAs (ncRNAs), whose incidence increases with developmental complexity. There is growing evidence that these transcripts are functional, particularly in the regulation of epigenetic processes, leading to the suggestion that they compose a hitherto hidden layer of genomic programming in humans and other complex organisms. However, to date, very few have been identified in genetic screens. Here I show that this is explicable by an historic emphasis, both phenotypically and technically, on mutations in protein-coding sequences, and by presumptions about the nature of regulatory mutations. Most variations in regulatory sequences produce relatively subtle phenotypic changes, in contrast to mutations in protein-coding sequences that frequently cause catastrophic component failure. Until recently, most mapping projects have focused on protein-coding sequences, and the limited number of identified regulatory mutations have been interpreted as affecting conventional cis-acting promoter and enhancer elements, although these regions are often themselves transcribed. Moreover, ncRNA-directed regulatory circuits underpin most, if not all, complex genetic phenomena in eukaryotes, including RNA interference-related processes such as transcriptional and post-transcriptional gene silencing, position effect variegation, hybrid dysgenesis, chromosome dosage compensation, parental imprinting and allelic exclusion, paramutation, and possibly transvection and transinduction. The next frontier is the identification and functional characterization of the myriad sequence variations that influence quantitative traits, disease susceptibility, and other complex characteristics, which are being shown by genome-wide association studies to lie mostly in noncoding, presumably regulatory, regions. There is every possibility that many of these variations will alter the interactions between regulatory RNAs and their targets, a prospect that should be borne in mind in future functional analyses.
doi:10.1371/journal.pgen.1000459
PMCID: PMC2667263  PMID: 19390609
13.  Evolutionary Processes Acting on Candidate cis-Regulatory Regions in Humans Inferred from Patterns of Polymorphism and Divergence 
PLoS Genetics  2009;5(8):e1000592.
Analysis of polymorphism and divergence in the non-coding portion of the human genome yields crucial information about factors driving the evolution of gene regulation. Candidate cis-regulatory regions spanning more than 15,000 genes in 15 African Americans and 20 European Americans were re-sequenced and aligned to the chimpanzee genome in order to identify potentially functional polymorphism and to characterize and quantify departures from neutral evolution. Distortions of the site frequency spectra suggest a general pattern of selective constraint on conserved non-coding sites in the flanking regions of genes (CNCs). Moreover, there is an excess of fixed differences that cannot be explained by a Gamma model of deleterious fitness effects, suggesting the presence of positive selection on CNCs. Extensions of the McDonald-Kreitman test identified candidate cis-regulatory regions with high probabilities of positive and negative selection near many known human genes, the biological characteristics of which exhibit genome-wide trends that differ from patterns observed in protein-coding regions. Notably, there is a higher probability of positive selection in candidate cis-regulatory regions near genes expressed in the fetal brain, suggesting that a larger portion of adaptive regulatory changes has occurred in genes expressed during brain development. Overall we find that natural selection has played an important role in the evolution of candidate cis-regulatory regions throughout hominid evolution.
Author Summary
It has been suggested that changes in gene expression may have played a more important role in the evolution of modern humans than changes in protein-coding sequences. In order to identify signatures of natural selection on candidate cis-regulatory regions, we examined single nucleotide polymorphisms obtained from the complete re-sequencing of conserved non-coding sites (CNCs) in the flanking regions of over 15,000 genes in 35 humans. Patterns of allele frequencies in CNCs indicate the presence of both positive and negative selection acting on standing variation within these candidate cis-regulatory regions, particularly for the 5′ and 3′ UTRs of genes. Gene-specific tests comparing levels of polymorphism and divergence identify several genes with strong signatures of selection on candidate cis-regulatory regions and suggest that the biological characteristics of genes subject to selection are different between coding and candidate cis-regulatory regions with respect to gene expression and function. For example, we find stronger signatures of positive selection in candidate cis-regulatory regions near genes expressed in the fetal brain, which we do not observe in a concurrent analysis on protein-coding regions. Our results suggest that both positive and negative selection have acted on candidate cis-regulatory regions and that the evolution of non-coding DNA has played an important role throughout hominid evolution.
doi:10.1371/journal.pgen.1000592
PMCID: PMC2714078  PMID: 19662163
14.  Mapping of numerous disease-associated expression polymorphisms in primary peripheral blood CD4+ lymphocytes 
Human Molecular Genetics  2010;19(23):4745-4757.
Genome-wide association studies of human gene expression promise to identify functional regulatory genetic variation that contributes to phenotypic diversity. However, it is unclear how useful this approach will be for the identification of disease-susceptibility variants. We generated gene expression profiles for 22 184 mRNA transcripts using RNA derived from peripheral blood CD4+ lymphocytes, and genome-wide genotype data for 516 512 autosomal markers in 200 subjects. We screened for cis-acting variants by testing variants mapping within 50 kb of expressed transcripts for association with transcript abundance using generalized linear models. Significant associations were identified for 1585 genes at a false discovery rate of 0.05 (corresponding to P-values ranging from 1 × 10−91 to 7 × 10−4). Importantly, we identified evidence of regulatory variation for 119 previously mapped disease genes, including 24 examples where the variant with the strongest evidence of disease-association demonstrates strong association with specific transcript abundance. The prevalence of cis-acting variants among disease-associated genes was 63% higher than the genome-wide rate in our data set (P = 6.41 × 10−6), and although many of the implicated loci were associated with immune-related diseases (including asthma, connective tissue disorders and inflammatory bowel disease), associations with genes implicated in non-immune-related diseases including lipid profiles, anthropomorphic measurements, cancer and neurologic disease were also observed. Genetic variants that confer inter-individual differences in gene expression represent an important subset of variants that contribute to disease susceptibility. Population-based integrative genetic approaches can help identify such variation and enhance our understanding of the genetic basis of complex traits.
doi:10.1093/hmg/ddq392
PMCID: PMC2972694  PMID: 20833654
15.  Function and Regulation of AUTS2, a Gene Implicated in Autism and Human Evolution 
PLoS Genetics  2013;9(1):e1003221.
Nucleotide changes in the AUTS2 locus, some of which affect only noncoding regions, are associated with autism and other neurological disorders, including attention deficit hyperactivity disorder, epilepsy, dyslexia, motor delay, language delay, visual impairment, microcephaly, and alcohol consumption. In addition, AUTS2 contains the most significantly accelerated genomic region differentiating humans from Neanderthals, which is primarily composed of noncoding variants. However, the function and regulation of this gene remain largely unknown. To characterize auts2 function, we knocked it down in zebrafish, leading to a smaller head size, neuronal reduction, and decreased mobility. To characterize AUTS2 regulatory elements, we tested sequences for enhancer activity in zebrafish and mice. We identified 23 functional zebrafish enhancers, 10 of which were active in the brain. Our mouse enhancer assays characterized three mouse brain enhancers that overlap an ASD–associated deletion and four mouse enhancers that reside in regions implicated in human evolution, two of which are active in the brain. Combined, our results show that AUTS2 is important for neurodevelopment and expose candidate enhancer sequences in which nucleotide variation could lead to neurological disease and human-specific traits.
Author Summary
Autism spectrum disorders (ASDs) are neurodevelopmental disorders that affect 1 in 88 individuals in the United States. Many gene mutations have been associated with autism; however, they explain only a small part of the genetic cause for this disorder. One gene that has been linked to autism is AUTS2. AUTS2 has been shown to be disrupted in more than 30 individuals with ASDs, both in coding and noncoding sequences (regions of the gene that do not encode for protein). However, its function remains largely unknown. We show here that AUTS2 is important for neuronal development in zebrafish. In addition, we characterize potential AUTS2 regulatory elements (DNA sequences that instruct genes as to where, when, and at what levels to turn on) that reside in noncoding regions that are mutated in ASD individuals. AUTS2 was also shown to be implicated in human evolution, having several regions where its human sequence significantly changed when compared to Neanderthals and non-human primates. Here, we identified four mouse enhancers within these evolving regions, two of which are expressed in the brain.
doi:10.1371/journal.pgen.1003221
PMCID: PMC3547868  PMID: 23349641
16.  Low Frequency Variants, Collapsed Based on Biological Knowledge, Uncover Complexity of Population Stratification in 1000 Genomes Project Data 
PLoS Genetics  2013;9(12):e1003959.
Analyses investigating low frequency variants have the potential for explaining additional genetic heritability of many complex human traits. However, the natural frequencies of rare variation between human populations strongly confound genetic analyses. We have applied a novel collapsing method to identify biological features with low frequency variant burden differences in thirteen populations sequenced by the 1000 Genomes Project. Our flexible collapsing tool utilizes expert biological knowledge from multiple publicly available database sources to direct feature selection. Variants were collapsed according to genetically driven features, such as evolutionary conserved regions, regulatory regions genes, and pathways. We have conducted an extensive comparison of low frequency variant burden differences (MAF<0.03) between populations from 1000 Genomes Project Phase I data. We found that on average 26.87% of gene bins, 35.47% of intergenic bins, 42.85% of pathway bins, 14.86% of ORegAnno regulatory bins, and 5.97% of evolutionary conserved regions show statistically significant differences in low frequency variant burden across populations from the 1000 Genomes Project. The proportion of bins with significant differences in low frequency burden depends on the ancestral similarity of the two populations compared and types of features tested. Even closely related populations had notable differences in low frequency burden, but fewer differences than populations from different continents. Furthermore, conserved or functionally relevant regions had fewer significant differences in low frequency burden than regions under less evolutionary constraint. This degree of low frequency variant differentiation across diverse populations and feature elements highlights the critical importance of considering population stratification in the new era of DNA sequencing and low frequency variant genomic analyses.
Author Summary
Low frequency variants are likely to play an important role in uncovering complex trait heritability; however, they are often continent or population specific. This specificity complicates genetic analyses investigating low frequency variants for two reasons: low frequency variant signals in an association test are often difficult to generalize beyond a single population or continental group, and there is an increase in false positive results in association analyses due to underlying population stratification. In order to reveal the magnitude of low frequency population stratification, we performed pairwise population comparisons using the 1000 Genomes Project Phase I data to investigate differences in low frequency variant burden across multiple biological features. We found that low frequency variant confounding is much more prevalent than one might expect, even within continental groups. The proportion of significant differences in low frequency variant burden was also dependent on the region of interest; for example, annotated regulatory regions showed fewer low frequency burden differences between populations than intergenic regions. Knowledge of population structure and the genomic landscape in a region of interest are important factors in determining the extent of confounding due to population stratification in a low frequency genomic analysis.
doi:10.1371/journal.pgen.1003959
PMCID: PMC3873241  PMID: 24385916
17.  Deciphering a transcriptional regulatory code: modeling short-range repression in the Drosophila embryo 
A well-defined set of transcriptional regulatory modules was created and analyzed in the Drosophila embryo.Fractional occupancy-based models were developed to explain the interaction of short range transcriptional repressors with endogenous activators by using quantitative data from these modules.Our fractional occupancy-based modeling uncovered specific quantitative features of short-range repressors; a complex nonlinear quenching relationship, similar quenching efficiencies for different activators, and modest levels of cooperativityThe extension of the study to endogenous enhancers highlighted several features of enhancer architecture design in Drosophila embryos.
Transcriptional regulatory information, represented by patterns of protein-binding sites on DNA, comprises an important portion of genetic coding. Despite the abundance of genomic sequences now available, identifying and characterizing this information remain a major challenge. Minor changes in protein-binding sites can have profound effects on gene expression, and such changes have been shown to underlie important aspects of disease and evolution. Thus, an important aim in contemporary systems biology is to develop a global understanding of the transcriptional regulatory code, allowing prediction of gene output based on DNA sequence information. Recent studies have focused on endogenous transcriptional regulatory sequences (Janssens et al, 2006; Zinzen et al, 2006; Segal et al, 2008); however, distinct enhancers differ in many features, including transcription factor activity, spacing, and cooperativity, making it difficult to learn the effects of individual features and generalize them to other cis-regulatory elements. We have pursued a bottom up approach to understand the mechanistic processing of regulatory elements by the transcriptional machinery, using a well-defined and characterized set of repressors and activators in Drosophila blastoderm embryos. The study focuses on the Giant, Krüppel, Knirps, and Snail proteins, which have been characterized as short-range repressors, able to act locally to interfere with activator function (quenching) (Gray et al, 1994; Arnosti et al, 1996a). Such repressors have central functions in development.
The aim our study was to enable ab initio predictions of enhancer function, given defined quantities of regulatory proteins and the sequence of the enhancer (Figure 1). We have generated a large quantitative data set using fluorescent confocal laser scanning microscopy to determine the inputs (Giant, Krüppel, and Knirps protein levels) and outputs (lacZ mRNA levels) of the regulatory elements introduced into Drosophila by transgenesis. We analyzed the effect of altering specific features of a set of related gene modules, designed to uncover critical aspects of repression, including quenching distance, cooperativity, and overall factor potency.
We generated specific descriptions for each regulatory element using fractional occupancy-based modeling and identified quantitative values for parameters affecting transcriptional regulation in vivo, and these parameters were used to build and test the model. Through this process, we uncovered earlier unknown features that allow correct predictions of regulation by short-range repressors, including a non-monotonic distance function for quenching, which implicates possible phasing effects, a modest contribution for repressor–repressor cooperativity, and similarity in repression of disparate activators.
By applying these parameters to a model of the endogenous rhomboid enhancer, we uncovered novel insights into the architecture of this enhancer (Figure 8). Our study provides essential quantitative elements of a transcriptional regulatory code that will allow extensive analysis of genomic information in Drosophila melanogaster and related organisms. Extension of these predictive models should facilitate the development of more sophisticated computational algorithms for the identification and functional characterization of novel regulatory elements. The development of such quantitative modeling tools will change our understanding of the genome from essentially a parts list to a dynamically regulated system, and will greatly facilitate studies in disease, population genetics, and evolutionary biology.
Systems biology seeks a genomic-level interpretation of transcriptional regulatory information represented by patterns of protein-binding sites. Obtaining this information without direct experimentation is challenging; minor alterations in binding sites can have profound effects on gene expression, and underlie important aspects of disease and evolution. Quantitative modeling offers an alternative path to develop a global understanding of the transcriptional regulatory code. Recent studies have focused on endogenous regulatory sequences; however, distinct enhancers differ in many features, making it difficult to generalize to other cis-regulatory elements. We applied a systematic approach to simpler elements and present here the first quantitative analysis of short-range transcriptional repressors, which have central functions in metazoan development. Our fractional occupancy-based modeling uncovered unexpected features of these proteins' activity that allow accurate predictions of regulation by the Giant, Knirps, Krüppel, and Snail repressors, including modeling of an endogenous enhancer. This study provides essential elements of a transcriptional regulatory code that will allow extensive analysis of genomic information in Drosophila melanogaster and related organisms.
doi:10.1038/msb.2009.97
PMCID: PMC2824527  PMID: 20087339
Drosophila; enhancer; modeling; repression; transcription
18.  Hypervariable intronic region in NCX1 is enriched in short insertion-deletion polymorphisms and showed association with cardiovascular traits 
BMC Medical Genetics  2010;11:15.
Background
Conserved non-coding regions (CNR) have been shown to harbor gene expression regulatory elements. Genetic variations in these regions may potentially contribute to complex disease susceptibility.
Methods
We targeted CNRs of cardiovascular disease (CVD) candidate gene, Na(+)-Ca(2+) exchanger (NCX1) with polymorphism screening among CVD patients (n = 46) using DHPLC technology. The flanking region (348 bp) of the 14 bp indel in intron 2 was further genotyped by DGGE assay in two Eastern-European CVD samples: essential hypertension (HYPEST; 470 cases, 652 controls) and coronary artery disease, CAD (CADCZ; 257 cases, controls 413). Genotype-phenotype associations were tested by regression analysis implemented in PLINK. Alignments of primate sequences were performed by ClustalW2.
Results
Nine of the identified NCX1 variants were either singletons or targeted by commercial platforms. The 14 bp intronic indel (rs11274804) was represented with substantial frequency in HYPEST (6.82%) and CADCZ (14.58%). Genotyping in Eastern-Europeans (n = 1792) revealed hypervariable nature of this locus, represented by seven alternative alleles. The alignments of human-chimpanzee-macaque sequences showed that the major human variant (allele frequency 90.45%) was actually a human-specific deletion compared to other primates. In humans, this deletion was surrounded by other short (5-43 bp) deletion variants and a duplication (40 bp) polymorphism possessing overlapping breakpoints. This indicates a potential indel hotspot, triggered by the initial deletion in human lineage. An association was detected between the carrier status of 14 bp indel ancestral allele and CAD (P = 0.0016, OR = 2.02; Bonferroni significance level alpha = 0.0045), but not with hypertension. The risk for the CAD development was even higher among the patients additionally diagnosed with metabolic syndrome (P = 0.0014, OR = 2.34). Consistent with the effect on metabolic processes, suggestive evidence for the association with heart rate, serum triglyceride and LDL levels was detected (P = 0.04).
Conclusions
Compared to SNPs targeted by large number of locus-specific and genome-wide assays, considerably less attention has been paid to short indel variants in the human genome. The data of genome dynamics, mutation rate and population genetics of short indels, as well as their impact on gene expressional profile and human disease susceptibility is limited. The characterization of NCX1 intronic hypervariable non-coding region enriched in human-specific indel variants contributes to this gap of knowledge.
doi:10.1186/1471-2350-11-15
PMCID: PMC2832636  PMID: 20109173
19.  Identification of rare X-linked neuroligin variants by massively parallel sequencing in males with autism spectrum disorder 
Molecular Autism  2012;3:8.
Background
Autism spectrum disorder (ASD) is highly heritable, but the genetic risk factors for it remain largely unknown. Although structural variants with large effect sizes may explain up to 15% ASD, genome-wide association studies have failed to uncover common single nucleotide variants with large effects on phenotype. The focus within ASD genetics is now shifting to the examination of rare sequence variants of modest effect, which is most often achieved via exome selection and sequencing. This strategy has indeed identified some rare candidate variants; however, the approach does not capture the full spectrum of genetic variation that might contribute to the phenotype.
Methods
We surveyed two loci with known rare variants that contribute to ASD, the X-linked neuroligin genes by performing massively parallel Illumina sequencing of the coding and noncoding regions from these genes in males from families with multiplex autism. We annotated all variant sites and functionally tested a subset to identify other rare mutations contributing to ASD susceptibility.
Results
We found seven rare variants at evolutionary conserved sites in our study population. Functional analyses of the three 3’ UTR variants did not show statistically significant effects on the expression of NLGN3 and NLGN4X. In addition, we identified two NLGN3 intronic variants located within conserved transcription factor binding sites that could potentially affect gene regulation.
Conclusions
These data demonstrate the power of massively parallel, targeted sequencing studies of affected individuals for identifying rare, potentially disease-contributing variation. However, they also point out the challenges and limitations of current methods of direct functional testing of rare variants and the difficulties of identifying alleles with modest effects.
doi:10.1186/2040-2392-3-8
PMCID: PMC3492087  PMID: 23020841
Autism spectrum disorder; Massively parallel DNA sequencing; Rare variation; Evolutionary conservation
20.  Genomic Variation and Its Impact on Gene Expression in Drosophila melanogaster 
PLoS Genetics  2012;8(11):e1003055.
Understanding the relationship between genetic and phenotypic variation is one of the great outstanding challenges in biology. To meet this challenge, comprehensive genomic variation maps of human as well as of model organism populations are required. Here, we present a nucleotide resolution catalog of single-nucleotide, multi-nucleotide, and structural variants in 39 Drosophila melanogaster Genetic Reference Panel inbred lines. Using an integrative, local assembly-based approach for variant discovery, we identify more than 3.6 million distinct variants, among which were more than 800,000 unique insertions, deletions (indels), and complex variants (1 to 6,000 bp). While the SNP density is higher near other variants, we find that variants themselves are not mutagenic, nor are regions with high variant density particularly mutation-prone. Rather, our data suggest that the elevated SNP density around variants is mainly due to population-level processes. We also provide insights into the regulatory architecture of gene expression variation in adult flies by mapping cis-expression quantitative trait loci (cis-eQTLs) for more than 2,000 genes. Indels comprise around 10% of all cis-eQTLs and show larger effects than SNP cis-eQTLs. In addition, we identified two-fold more gene associations in males as compared to females and found that most cis-eQTLs are sex-specific, revealing a partial decoupling of the genomic architecture between the sexes as well as the importance of genetic factors in mediating sex-biased gene expression. Finally, we performed RNA-seq-based allelic expression imbalance analyses in the offspring of crosses between sequenced lines, which revealed that the majority of strong cis-eQTLs can be validated in heterozygous individuals.
Author Summary
One of the principal challenges in current biology is to understand the relationship between genetic and phenotypic variation. The increasing availability of genomic variation maps of human as well as of model organism populations (mouse and Arabidopsis) constitutes an important step towards meeting this challenge. However, despite its excellent track record as a premier model to understand genome function, no genome-wide variation data beyond single-nucleotide variants and microsatellites are currently available for D. melanogaster. Here, we present a comprehensive, nucleotide-resolution catalogue of variants of various types (single-nucleotide, multi-nucleotide, and structural variants) for 39 wild-derived inbred D. melanogaster lines based on high-throughput sequencing. This catalogue confirms that non–SNP variants account for more than half of genomic variation, allowing us to provide new insights into the non-random distribution of variants in the Drosophila genome. We further present genome-wide cis-associations with gene expression based on whole adult fly microarray data, revealing significant associations for about 2,000 genes. Most associations are sex-specific, providing evidence for a decoupling of the genomic, regulatory architecture between males and females.
doi:10.1371/journal.pgen.1003055
PMCID: PMC3499359  PMID: 23189034
21.  An Evolutionary Analysis of Antigen Processing and Presentation across Different Timescales Reveals Pervasive Selection 
PLoS Genetics  2014;10(3):e1004189.
The antigenic repertoire presented by MHC molecules is generated by the antigen processing and presentation (APP) pathway. We analyzed the evolutionary history of 45 genes involved in APP at the inter- and intra-species level. Results showed that 11 genes evolved adaptively in mammals. Several positively selected sites involve positions of fundamental importance to the protein function (e.g. the TAP1 peptide-binding domains, the sugar binding interface of langerin, and the CD1D trafficking signal region). In CYBB, all selected sites cluster in two loops protruding into the endosomal lumen; analysis of missense mutations responsible for chronic granulomatous disease (CGD) showed the action of different selective forces on the very same gene region, as most CGD substitutions involve aminoacid positions that are conserved in all mammals. As for ERAP2, different computational methods indicated that positive selection has driven the recurrent appearance of protein-destabilizing variants during mammalian evolution. Application of a population-genetics phylogenetics approach showed that purifying selection represented a major force acting on some APP components (e.g. immunoproteasome subunits and chaperones) and allowed identification of positive selection events in the human lineage.
We also investigated the evolutionary history of APP genes in human populations by developing a new approach that uses several different tests to identify the selection target, and that integrates low-coverage whole-genome sequencing data with Sanger sequencing. This analysis revealed that 9 APP genes underwent local adaptation in human populations. Most positive selection targets are located within noncoding regions with regulatory function in myeloid cells or act as expression quantitative trait loci. Conversely, balancing selection targeted nonsynonymous variants in TAP1 and CD207 (langerin). Finally, we suggest that selected variants in PSMB10 and CD207 contribute to human phenotypes. Thus, we used evolutionary information to generate experimentally-testable hypotheses and to provide a list of sites to prioritize in follow-up analyses.
Author Summary
Antigen-presenting cells digest intracellular and extracellular proteins and display the resulting antigenic repertoire on cell surface molecules for recognition by T cells. This process initiates cell-mediated immune responses and is essential to detect infections. The antigenic repertoire is generated by the antigen processing and presentation pathway. Because several pathogens evade immune recognition by hampering this process, genes involved in antigen processing and presentation may represent common natural selection targets. Thus, we analyzed the evolutionary history of these genes during mammalian evolution and in the more recent history of human populations. Evolutionary analyses in mammals indicated that positive selection targeted a very high proportion of genes (24%), and revealed that many selected sites affect positions of fundamental importance to the protein function. In humans, we found different signatures of natural selection acting both on regions that are expected to regulate gene expression levels or timing and on coding variants; two human selected polymorphisms may modulate the susceptibility to Crohn's disease and to HIV-1 infection. Therefore, we provide a comprehensive evolutionary analysis of antigen processing and we show that evolutionary studies can provide useful information concerning the location and nature of functional variants, ultimately helping to clarify phenotypic differences between and within species.
doi:10.1371/journal.pgen.1004189
PMCID: PMC3967941  PMID: 24675550
22.  The essential genome of a bacterium 
This study reports the essential Caulobacter genome at 8 bp resolution determined by saturated transposon mutagenesis and high-throughput sequencing. This strategy is applicable to full genome essentiality studies in a broad class of bacterial species.
The essential Caulobacter genome was determined at 8 bp resolution using hyper-saturated transposon mutagenesis coupled with high-throughput sequencing.Essential protein-coding sequences comprise 90% of the essential genome; the remaining 10% comprising essential non-coding RNA sequences, gene regulatory elements and essential genome replication features.Of the 3876 annotated open reading frames (ORFs), 480 (12.4%) were essential ORFs, 3240 (83.6%) were non-essential ORFs and 156 (4.0%) were ORFs that severely impacted fitness when mutated.The essential elements are preferentially positioned near the origin and terminus of the Caulobacter chromosome.This high-resolution strategy is applicable to high-throughput, full genome essentiality studies and large-scale genetic perturbation experiments in a broad class of bacterial species.
The regulatory events that control polar differentiation and cell-cycle progression in the bacterium Caulobacter crescentus are highly integrated, and they have to occur in the proper order (McAdams and Shapiro, 2011). Components of the core regulatory circuit are largely known. Full discovery of its essential genome, including non-coding, regulatory and coding elements, is a prerequisite for understanding the complete regulatory network of this bacterial cell. We have identified all the essential coding and non-coding elements of the Caulobacter chromosome using a hyper-saturated transposon mutagenesis strategy that is scalable and can be readily extended to obtain rapid and accurate identification of the essential genome elements of any sequenced bacterial species at a resolution of a few base pairs.
We engineered a Tn5 derivative transposon (Tn5Pxyl) that carries at one end an inducible outward pointing Pxyl promoter (Christen et al, 2010). We showed that this transposon construct inserts into the genome randomly where it can activate or disrupt transcription at the site of integration, depending on the insertion orientation. DNA from hundred of thousands of transposon insertion sites reading outward into flanking genomic regions was parallel PCR amplified and sequenced by Illumina paired-end sequencing to locate the insertion site in each mutant strain (Figure 1). A single sequencing run on DNA from a mutagenized cell population yielded 118 million raw sequencing reads. Of these, >90 million (>80%) read outward from the transposon element into adjacent genomic DNA regions and the insertion site could be mapped with single nucleotide resolution. This yielded the location and orientation of 428 735 independent transposon insertions in the 4-Mbp Caulobacter genome.
Within non-coding sequences of the Caulobacter genome, we detected 130 non-disruptable DNA segments between 90 and 393 bp long in addition to all essential promoter elements. Among 27 previously identified and validated sRNAs (Landt et al, 2008), three were contained within non-disruptable DNA segments and another three were partially disruptable, that is, insertions caused a notable growth defect. Two additional small RNAs found to be essential are the transfer-messenger RNA (tmRNA) and the ribozyme RNAseP (Landt et al, 2008). In addition to the 8 non-disruptable sRNAs, 29 out of the 130 intergenic essential non-coding sequences contained non-redundant tRNA genes; duplicated tRNA genes were non-essential. We also identified two non-disruptable DNA segments within the chromosomal origin of replication. Thus, we resolved essential non-coding RNAs, tRNAs and essential replication elements within the origin region of the chromosome. An additional 90 non-disruptable small genome elements of currently unknown function were identified. Eighteen of these are conserved in at least one closely related species. Only 2 could encode a protein of over 50 amino acids.
For each of the 3876 annotated open reading frames (ORFs), we analyzed the distribution, orientation, and genetic context of transposon insertions. There are 480 essential ORFs and 3240 non-essential ORFs. In addition, there were 156 ORFs that severely impacted fitness when mutated. The 8-bp resolution allowed a dissection of the essential and non-essential regions of the coding sequences. Sixty ORFs had transposon insertions within a significant portion of their 3′ region but lacked insertions in the essential 5′ coding region, allowing the identification of non-essential protein segments. For example, transposon insertions in the essential cell-cycle regulatory gene divL, a tyrosine kinase, showed that the last 204 C-terminal amino acids did not impact viability, confirming previous reports that the C-terminal ATPase domain of DivL is dispensable for viability (Reisinger et al, 2007; Iniesta et al, 2010). In addition, we found that 30 out of 480 (6.3%) of the essential ORFs appear to be shorter than the annotated ORF, suggesting that these are probably mis-annotated.
Among the 480 ORFs essential for growth on rich media, there were 10 essential transcriptional regulatory proteins, including 5 previously identified cell-cycle regulators (McAdams and Shapiro, 2003; Holtzendorff et al, 2004; Collier and Shapiro, 2007; Gora et al, 2010; Tan et al, 2010) and 5 uncharacterized predicted transcription factors. In addition, two RNA polymerase sigma factors RpoH and RpoD, as well as the anti-sigma factor ChrR, which mitigates rpoE-dependent stress response under physiological growth conditions (Lourenco and Gomes, 2009), were also found to be essential. Thus, a set of 10 transcription factors, 2 RNA polymerase sigma factors and 1 anti-sigma factor are the core essential transcriptional regulators for growth on rich media. To further characterize the core components of the Caulobacter cell-cycle control network, we identified all essential regulatory sequences and operon transcripts. Altogether, the 480 essential protein-coding and 37 essential RNA-coding Caulobacter genes are organized into operons such that 402 individual promoter regions are sufficient to regulate their expression. Of these 402 essential promoters, the transcription start sites (TSSs) of 105 were previously identified (McGrath et al, 2007).
The essential genome features are non-uniformly distributed on the Caulobacter genome and enriched near the origin and the terminus regions. In contrast, the chromosomal positions of the published E. coli essential coding sequences (Rocha, 2004) are preferentially located at either side of the origin (Figure 4A). This indicates that there are selective pressures on chromosomal positioning of some essential elements (Figure 4A).
The strategy described in this report could be readily extended to quickly determine the essential genome for a large class of bacterial species.
Caulobacter crescentus is a model organism for the integrated circuitry that runs a bacterial cell cycle. Full discovery of its essential genome, including non-coding, regulatory and coding elements, is a prerequisite for understanding the complete regulatory network of a bacterial cell. Using hyper-saturated transposon mutagenesis coupled with high-throughput sequencing, we determined the essential Caulobacter genome at 8 bp resolution, including 1012 essential genome features: 480 ORFs, 402 regulatory sequences and 130 non-coding elements, including 90 intergenic segments of unknown function. The essential transcriptional circuitry for growth on rich media includes 10 transcription factors, 2 RNA polymerase sigma factors and 1 anti-sigma factor. We identified all essential promoter elements for the cell cycle-regulated genes. The essential elements are preferentially positioned near the origin and terminus of the chromosome. The high-resolution strategy used here is applicable to high-throughput, full genome essentiality studies and large-scale genetic perturbation experiments in a broad class of bacterial species.
doi:10.1038/msb.2011.58
PMCID: PMC3202797  PMID: 21878915
functional genomics; next-generation sequencing; systems biology; transposon mutagenesis
23.  A Catalog of Neutral and Deleterious Polymorphism in Yeast 
PLoS Genetics  2008;4(8):e1000183.
The abundance and identity of functional variation segregating in natural populations is paramount to dissecting the molecular basis of quantitative traits as well as human genetic diseases. Genome sequencing of multiple organisms of the same species provides an efficient means of cataloging rearrangements, insertion, or deletion polymorphisms (InDels) and single-nucleotide polymorphisms (SNPs). While inbreeding depression and heterosis imply that a substantial amount of polymorphism is deleterious, distinguishing deleterious from neutral polymorphism remains a significant challenge. To identify deleterious and neutral DNA sequence variation within Saccharomyces cerevisiae, we sequenced the genome of a vineyard and oak tree strain and compared them to a reference genome. Among these three strains, 6% of the genome is variable, mostly attributable to variation in genome content that results from large InDels. Out of the 88,000 polymorphisms identified, 93% are SNPs and a small but significant fraction can be attributed to recent interspecific introgression and ectopic gene conversion. In comparison to the reference genome, there is substantial evidence for functional variation in gene content and structure that results from large InDels, frame-shifts, and polymorphic start and stop codons. Comparison of polymorphism to divergence reveals scant evidence for positive selection but an abundance of evidence for deleterious SNPs. We estimate that 12% of coding and 7% of noncoding SNPs are deleterious. Based on divergence among 11 yeast species, we identified 1,666 nonsynonymous SNPs that disrupt conserved amino acids and 1,863 noncoding SNPs that disrupt conserved noncoding motifs. The deleterious coding SNPs include those known to affect quantitative traits, and a subset of the deleterious noncoding SNPs occurs in the promoters of genes that show allele-specific expression, implying that some cis-regulatory SNPs are deleterious. Our results show that the genome sequences of both closely and distantly related species provide a means of identifying deleterious polymorphisms that disrupt functionally conserved coding and noncoding sequences.
Author Summary
DNA sequence variation makes an important contribution to most traits that vary in natural populations. However, mapping mutations that underlie a trait of interest is a significant challenge. Genome sequencing of multiple organisms provides a complete list of DNA sequence differences responsible for any trait that differs among the organisms. Yet, distinguishing those DNA sequence variants that contribute to a trait from all other variants is not easy. Here, we sequence the genomes of two strains of yeast and, through comparisons with a reference genome, we catalog multiple types of DNA sequence variation among the three strains. Using a variety of comparative genomics methods, we show that a substantial fraction of DNA sequence variations has deleterious effects on fitness. Finally, we show that a subset of deleterious mutations is associated with changes in gene expression levels. Our results imply that comparative genomics methods will be a valuable approach to identifying DNA sequence changes underlying numerous traits of interest.
doi:10.1371/journal.pgen.1000183
PMCID: PMC2515631  PMID: 18769710
24.  A Genome-Wide Screen for Genetic Variants That Modify the Recruitment of REST to Its Target Genes 
PLoS Genetics  2012;8(4):e1002624.
Increasing numbers of human diseases are being linked to genetic variants, but our understanding of the mechanistic links leading from DNA sequence to disease phenotype is limited. The majority of disease-causing nucleotide variants fall within the non-protein-coding portion of the genome, making it likely that they act by altering gene regulatory sequences. We hypothesised that SNPs within the binding sites of the transcriptional repressor REST alter the degree of repression of target genes. Given that changes in the effective concentration of REST contribute to several pathologies—various cancers, Huntington's disease, cardiac hypertrophy, vascular smooth muscle proliferation—these SNPs should alter disease-susceptibility in carriers. We devised a strategy to identify SNPs that affect the recruitment of REST to target genes through the alteration of its DNA recognition element, the RE1. A multi-step screen combining genetic, genomic, and experimental filters yielded 56 polymorphic RE1 sequences with robust and statistically significant differences of affinity between alleles. These SNPs have a considerable effect on the the functional recruitment of REST to DNA in a range of in vitro, reporter gene, and in vivo analyses. Furthermore, we observe allele-specific biases in deeply sequenced chromatin immunoprecipitation data, consistent with predicted differenes in RE1 affinity. Amongst the targets of polymorphic RE1 elements are important disease genes including NPPA, PTPRT, and CDH4. Thus, considerable genetic variation exists in the DNA motifs that connect gene regulatory networks. Recently available ChIP–seq data allow the annotation of human genetic polymorphisms with regulatory information to generate prior hypotheses about their disease-causing mechanism.
Author Summary
Common human diseases such as cancer, heart disease, or epilepsy have a genetic component that predisposes particular individuals to suffer from them. Huge sums have been invested to map the regions of the human genome where small DNA variations, or SNPs (“single-nucleotide polymorphisms”), determine the probability of developing these diseases. A major problem with this approach, however, is that, once the culprit SNPs are discovered, we know very little about how they cause disease—which is critical if we are to use this information to develop drugs and therapies. In this study, we demonstrate a new approach, employing functional maps of the human genome that have recently been published. We begin with regions of the genome recognised by a gene repressor protein—REST—that is involved in a number of important human diseases. Using information on where REST binds in the human genome, we predict and validate common DNA variations that increase or decrease this binding. By affecting how much REST is recruited to important genes, these variations may predispose or protect individuals from a number of diseases. Studies like this show how we can use genomic information to gain a deeper understanding of the genetics behind human disease.
doi:10.1371/journal.pgen.1002624
PMCID: PMC3320604  PMID: 22496669
25.  A Population Genetic Approach to Mapping Neurological Disorder Genes Using Deep Resequencing 
PLoS Genetics  2011;7(2):e1001318.
Deep resequencing of functional regions in human genomes is key to identifying potentially causal rare variants for complex disorders. Here, we present the results from a large-sample resequencing (n = 285 patients) study of candidate genes coupled with population genetics and statistical methods to identify rare variants associated with Autism Spectrum Disorder and Schizophrenia. Three genes, MAP1A, GRIN2B, and CACNA1F, were consistently identified by different methods as having significant excess of rare missense mutations in either one or both disease cohorts. In a broader context, we also found that the overall site frequency spectrum of variation in these cases is best explained by population models of both selection and complex demography rather than neutral models or models accounting for complex demography alone. Mutations in the three disease-associated genes explained much of the difference in the overall site frequency spectrum among the cases versus controls. This study demonstrates that genes associated with complex disorders can be mapped using resequencing and analytical methods with sample sizes far smaller than those required by genome-wide association studies. Additionally, our findings support the hypothesis that rare mutations account for a proportion of the phenotypic variance of these complex disorders.
Author Summary
It is widely accepted that genetic factors play important roles in the etiology of neurological diseases. However, the nature of the underlying genetic variation remains unclear. Critical questions in the field of human genetics relate to the frequency and size effects of genetic variants associated with disease. For instance, the common disease–common variant model is based on the idea that sets of common variants explain a significant fraction of the variance found in common disease phenotypes. On the other hand, rare variants may have strong effects and therefore largely contribute to disease phenotypes. Due to their high penetrance and reduced fitness, such variants are maintained in the population at low frequencies, thus limiting their detection in genome-wide association studies. Here, we use a resequencing approach on a cohort of 285 Autism Spectrum Disorder and Schizophrenia patients and preformed several analyses, enhanced with population genetic approaches, to identify variants associated with both diseases. Our results demonstrate an excess of rare variants in these disease cohorts and identify genes with negative (deleterious) selection coefficients, suggesting an accumulation of variants of detrimental effects. Our results present further evidence for rare variants explaining a component of the genetic etiology of autism and schizophrenia.
doi:10.1371/journal.pgen.1001318
PMCID: PMC3044677  PMID: 21383861

Results 1-25 (1185546)