Histone modifications are important markers of function and chromatin state, yet the DNA sequence elements that direct them to specific genomic locations are poorly understood. Here, we identify hundreds of quantitative trait loci, genome-wide, that affect histone modification or RNA polymerase II (Pol II) occupancy in Yoruba lymphoblastoid cell lines (LCLs). In many cases, the same variant is associated with quantitative changes in multiple histone marks and Pol II, as well as in deoxyribonuclease I sensitivity and nucleosome positioning. Transcription factor binding site polymorphisms are correlated overall with differences in local histone modification, and we identify specific transcription factors whose binding leads to histone modification in LCLs. Furthermore, variants that affect chromatin at distal regulatory sites frequently also direct changes in chromatin and gene expression at associated promoters.
One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly play an important role in determining gene expression outputs, yet the regulatory logic underlying functional transcription factor binding is poorly understood. Many studies have focused on characterizing the genomic locations of TF binding, yet it is unclear to what extent TF binding at any specific locus has functional consequences with respect to gene expression output. To evaluate the context of functional TF binding we knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line. We then identified genes whose expression was affected by the knockdowns. We intersected the gene expression data with transcription factor binding data (based on ChIP-seq and DNase-seq) within 10 kb of the transcription start sites of expressed genes. This combination of data allowed us to infer functional TF binding. Using this approach, we found that only a small subset of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. We found that functional TF binding is enriched in regulatory elements that harbor a large number of TF binding sites, at sites with predicted higher binding affinity, and at sites that are enriched in genomic regions annotated as “active enhancers.”
An important question in genomics is to understand how a class of proteins called “transcription factors” controls the expression level of other genes in the genome in a cell-type-specific manner – a process that is essential to human development. One major approach to this problem is to study where these transcription factors bind in the genome, but this does not tell us about the effect of that binding on gene expression levels and it is generally accepted that much of the binding does not strongly influence gene expression. To address this issue, we artificially reduced the concentration of 59 different transcription factors in the cell and then examined which genes were impacted by the reduced transcription factor level. Our results implicate some attributes that might influence what binding is functional, but they also suggest that a simple model of functional vs. non-functional binding may not suffice.
Chromatin architectural proteins interact with nucleosomes to modulate chromatin accessibility and higher-order chromatin structure. While these proteins are almost certainly important for gene regulation they have been studied far less than the core histone proteins.
Here we describe the genomic distributions and functional roles of two chromatin architectural proteins: histone H1 and the high mobility group protein HMGD1 in Drosophila S2 cells. Using ChIP-seq, biochemical and gene specific approaches, we find that HMGD1 binds to highly accessible regulatory chromatin and active promoters. In contrast, H1 is primarily associated with heterochromatic regions marked with repressive histone marks. We find that the ratio of HMGD1 to H1 binding is a better predictor of gene activity than either protein by itself, which suggests that reciprocal binding between these proteins is important for gene regulation. Using knockdown experiments, we show that HMGD1 and H1 affect the occupancy of the other protein, change nucleosome repeat length and modulate gene expression.
Collectively, our data suggest that dynamic and mutually exclusive binding of H1 and HMGD1 to nucleosomes and their linker sequences may control the fluid chromatin structure that is required for transcriptional regulation. This study provides a framework to further study the interplay between chromatin architectural proteins and epigenetics in gene regulation.
Chromatin structure; Transcriptional regulation; Histone H1; High mobility group protein; Nucleosome repeat length
Current genome-wide association studies (GWAS) have high power to detect intermediate frequency SNPs making modest contributions to complex disease, but they are underpowered to detect rare alleles of large effect (RALE). This has led to speculation that the bulk of variation for most complex diseases is due to RALE. One concern with existing models of RALE is that they do not make explicit assumptions about the evolution of a phenotype and its molecular basis. Rather, much of the existing literature relies on arbitrary mapping of phenotypes onto genotypes obtained either from standard population-genetic simulation tools or from non-genetic models. We introduce a novel simulation of a 100-kilobase gene region, based on the standard definition of a gene, in which mutations are unconditionally deleterious, are continuously arising, have partially recessive and non-complementing effects on phenotype (analogous to what is widely observed for most Mendelian disorders), and are interspersed with neutral markers that can be genotyped. Genes evolving according to this model exhibit a characteristic GWAS signature consisting of an excess of marginally significant markers. Existing tests for an excess burden of rare alleles in cases have low power while a simple new statistic has high power to identify disease genes evolving under our model. The structure of linkage disequilibrium between causative mutations and significantly associated markers under our model differs fundamentally from that seen when rare causative markers are assumed to be neutral. Rather than tagging single haplotypes bearing a large number of rare causative alleles, we find that significant SNPs in a GWAS tend to tag single causative mutations of small effect relative to other mutations in the same gene. Our results emphasize the importance of evaluating the power to detect associations under models that are genetically and evolutionarily motivated.
Current GWA studies typically only explain a small fraction of heritable variation in complex traits, resulting in speculation that a large fraction of variation in such traits may be due to rare alleles of large effect (RALE). The most parsimonious evolutionary mechanism that results in an inverse relationship between the frequency and effect size of causative alleles is an equilibrium between newly arising deleterious mutations and selection eliminating those mutations, resulting in an inverse relation between effect size and average frequency. This assumption is not built into many current models of RALE and, as a result, power calculations may be misleading. We use forward population genetic simulations to explore the ability of GWAS to detect genes in which unconditionally deleterious, partially recessive mutations arise each generation. Our model is based on the standard definition of a gene as a region within which loss-of-function mutations fail to complement, consistent with the multi-allelic basis for Mendelian disorders. Our model predicts that it may not be uncommon for single genes evolving under our model to contribute upwards of 5% to variation in a complex trait, and that such genes could be routinely detected via modified GWAS approaches.
Sub-Saharan Africa has been identified as the part of the world with the greatest human genetic diversity. This high level of diversity causes difficulties for genome-wide association (GWA) studies in African populations—for example, by reducing the accuracy of genotype imputation in African populations compared to non-African populations. Here, we investigate haplotype variation and imputation in Africa, using 253 unrelated individuals from 15 Sub-Saharan African populations. We identify the populations that provide the greatest potential for serving as reference panels for imputing genotypes in the remaining groups. Considering reference panels comprising samples of recent African descent in Phase 3 of the HapMap Project, we identify mixtures of reference groups that produce the maximal imputation accuracy in each of the sampled populations. We find that optimal HapMap mixtures and maximal imputation accuracies identified in detailed tests of imputation procedures can instead be predicted by using simple summary statistics that measure relationships between the pattern of genetic variation in a target population and the patterns in potential reference panels. Our results provide an empirical basis for facilitating the selection of reference panels in GWA studies of diverse human populations, especially those of African ancestry. Genet. Epidemiol. 35:766–780, 2011.
haplotype variation; imputation; linkage disequilibrium
Genetic clustering algorithms require a certain amount of data to produce informative results. In the common situation that individuals are sampled at several locations, we show how sample group information can be used to achieve better results when the amount of data is limited. New models are developed for the structure program, both for the cases of admixture and no admixture. These models work by modifying the prior distribution for each individual’s population assignment. The new prior distributions allow the proportion of individuals assigned to a particular cluster to vary by location. The models are tested on simulated data, and illustrated using microsatellite data from the CEPH Human Genome Diversity Panel. We demonstrate that the new models allow structure to be detected at lower levels of divergence, or with less data, than the original structure models or principal components methods, and that they are not biased towards detecting structure when it is not present. These models are implemented in a new version of structure which is freely available online at http://pritch.bsd.uchicago.edu/structure.html.
admixture; divergence; population structure; prior distribution
Although hypoxia is a major stress on physiological processes, several human populations have survived for millennia at high altitudes, suggesting that they have adapted to hypoxic conditions. This hypothesis was recently corroborated by studies of Tibetan highlanders, which showed that polymorphisms in candidate genes show signatures of natural selection as well as well-replicated association signals for variation in hemoglobin levels. We extended genomic analysis to two Ethiopian ethnic groups: Amhara and Oromo. For each ethnic group, we sampled low and high altitude residents, thus allowing genetic and phenotypic comparisons across altitudes and across ethnic groups. Genome-wide SNP genotype data were collected in these samples by using Illumina arrays. We find that variants associated with hemoglobin variation among Tibetans or other variants at the same loci do not influence the trait in Ethiopians. However, in the Amhara, SNP rs10803083 is associated with hemoglobin levels at genome-wide levels of significance. No significant genotype association was observed for oxygen saturation levels in either ethnic group. Approaches based on allele frequency divergence did not detect outliers in candidate hypoxia genes, but the most differentiated variants between high- and lowlanders have a clear role in pathogen defense. Interestingly, a significant excess of allele frequency divergence was consistently detected for genes involved in cell cycle control and DNA damage and repair, thus pointing to new pathways for high altitude adaptations. Finally, a comparison of CpG methylation levels between high- and lowlanders found several significant signals at individual genes in the Oromo.
Although hypoxia is a major stress on physiological processes, several human populations have survived for millennia at high altitudes, suggesting that they have adapted to hypoxic conditions. Consistent with this idea, previous studies have identified genetic variants in Tibetan highlanders associated with reduction in hemoglobin levels, an advantageous phenotype at high altitude. To compare the genetic bases of adaptations to high altitude, we collected genetic and epigenetic data in Ethiopians living at high and low altitude, respectively. We find that variants associated with hemoglobin variation among Tibetans or other variants at the same loci do not influence the trait in Ethiopians. However, we find a different variant that is significantly associated with hemoglobin levels in Ethiopians. Approaches based on the difference in allele frequency between high- and lowlanders detected strong signals in genes with a clear role in defense from pathogens, consistent with known differences in pathogens between altitudes. Finally, we found a few genome-wide significant epigenetic differences between altitudes. These results taken together imply that Ethiopian and Tibetan highlanders adapted to the same environmental stress through different variants and genetic loci.
The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation1-5. However, it remains difficult to identify the causal variants underlying eQTLs and little is known about the regulatory mechanisms by which they act. To address this gap, we used DNaseI sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines (LCLs), for which genome-wide genotypes and estimates of gene expression levels are also available6-8. We obtained a total of 2.7 billion uniquely mapped DNase-seq reads, which allowed us to produce genome-wide maps of chromatin accessibility for each individual. We identified 9,595 locations at which DNase-seq read depth correlates significantly with genotype at a nearby SNP or indel (FDR=10%). We call such variants “DNaseI sensitivity Quantitative Trait Loci” (dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele-specific changes in transcription factor binding. A substantial fraction (16%) of dsQTLs are also associated with variation in the expression levels of nearby genes, (namely, these loci are also classified as eQTLs). Conversely, we estimate that as many as 55% of eQTL SNPs are also dsQTLs. Our observations indicate that dsQTLs are highly abundant in the human genome, and are likely to be important contributors to phenotypic variation.
Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In our model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data, we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called TreeMix, is available at http://treemix.googlecode.com.
With modern genotyping technology, it is now possible to obtain large amounts of genetic data from many populations in a species. An important question that can be addressed with these data is: what is the history of these populations? There is a long history in population genetics of inferring the relationships among populations as a bifurcating tree, analogous to phylogenetic trees for representing the evolution of species. However, it has long been recognized that, since populations from the same species exchange genes, simple bifurcating trees may be an incorrect representation of population histories. We have developed a method to address this issue, using a model which allows for both population splits and gene flow. In application to humans, we show that we are able to identify a number of both previously known and unknown episodes of gene flow in history, including gene flow into Cambodia of a population only distantly related to modern East Asia. In application to dogs, we show that the boxer and basenji breeds have a considerable component of ancestry from grey wolves subsequent to domestication.
Mapping of expression quantitative trait loci (eQTLs) is an important technique for studying how genetic variation affects gene regulation in natural populations. In a previous study using Illumina expression data from human lymphoblastoid cell lines, we reported that cis-eQTLs are especially enriched around transcription start sites (TSSs) and immediately upstream of transcription end sites (TESs). In this paper, we revisit the distribution of eQTLs using additional data from Affymetrix exon arrays and from RNA sequencing. We confirm that most eQTLs lie close to the target genes; that transcribed regions are generally enriched for eQTLs; that eQTLs are more abundant in exons than introns; and that the peak density of eQTLs occurs at the TSS. However, we find that the intriguing TES peak is greatly reduced or absent in the Affymetrix and RNA-seq data. Instead our data suggest that the TES peak observed in the Illumina data is mainly due to exon-specific QTLs that affect 3′ untranslated regions, where most of the Illumina probes are positioned. Nonetheless, we do observe an overall enrichment of eQTLs in exons versus introns in all three data sets, consistent with an important role for exonic sequences in gene regulation.
We present a high-coverage draft genome assembly of the aye-aye (Daubentonia madagascariensis), a highly unusual nocturnal primate from Madagascar. Our assembly totals ∼3.0 billion bp (3.0 Gb), roughly the size of the human genome, comprised of ∼2.6 million scaffolds (N50 scaffold size = 13,597 bp) based on short paired-end sequencing reads. We compared the aye-aye genome sequence data with four other published primate genomes (human, chimpanzee, orangutan, and rhesus macaque) as well as with the mouse and dog genomes as nonprimate outgroups. Unexpectedly, we observed strong evidence for a relatively slow substitution rate in the aye-aye lineage compared with these and other primates. In fact, the aye-aye branch length is estimated to be ∼10% shorter than that of the human lineage, which is known for its low substitution rate. This finding may be explained, in part, by the protracted aye-aye life-history pattern, including late weaning and age of first reproduction relative to other lemurs. Additionally, the availability of this draft lemur genome sequence allowed us to polarize nucleotide and protein sequence changes to the ancestral primate lineage—a critical period in primate evolution, for which the relevant fossil record is sparse. Finally, we identified 293,800 high-confidence single nucleotide polymorphisms in the donor individual for our aye-aye genome sequence, a captive-born individual from two wild-born parents. The resulting heterozygosity estimate of 0.051% is the lowest of any primate studied to date, which is understandable considering the aye-aye's extensive home-range size and relatively low population densities. Yet this level of genetic diversity also suggests that conservation efforts benefiting this unusual species should be prioritized, especially in the face of the accelerating degradation and fragmentation of Madagascar's forests.
genome assembly; molecular clock; primate evolution; lemur
Deleterious mutations present a significant obstacle to adaptive evolution. Deleterious mutations can inhibit the spread of linked adaptive mutations through a population; conversely, adaptive substitutions can increase the frequency of linked deleterious mutations and even result in their fixation. To assess the impact of adaptive mutations on linked deleterious mutations, we examined the distribution of deleterious and neutral amino acid polymorphism in the human genome. Within genomic regions that show evidence of recent hitchhiking, we find fewer neutral but a similar number of deleterious SNPs compared to other genomic regions. The higher ratio of deleterious to neutral SNPs is consistent with simulated hitchhiking events and implies that positive selection eliminates some deleterious alleles and increases the frequency of others. The distribution of disease-associated alleles is also altered in hitchhiking regions. Disease alleles within hitchhiking regions have been associated with auto-immune disorders, metabolic diseases, cancers, and mental disorders. Our results suggest that positive selection has had a significant impact on deleterious polymorphism and may be partly responsible for the high frequency of certain human disease alleles.
Deleterious mutations reduce fitness within natural populations and must be continually removed by natural selection. However, some deleterious mutations reach unexpectedly high frequencies. There are a number of mechanisms by which this could occur, including changes in genetic or environmental constraints. Here, we investigate the hypothesis that some deleterious mutations have hitchhiked to high frequency due to linkage to sites that have been under positive selection. Using a collated set of regions likely to have been influenced by positive selection, we find that the number of deleterious polymorphisms in hitchhiking and non-hitchhiking regions is similar, but that the ratio of deleterious to neutral polymorphism is higher in hitchhiking compared to non-hitchhiking regions. Both computer simulations and empirical data indicate that while hitchhiking eliminates many deleterious mutations, some are increased in frequency. The distribution of human disease-associated mutations is also altered in hitchhiking compared to non-hitchhiking regions. Together, our results provide evidence that hitchhiking has influenced the frequency of linked deleterious mutations in humans, implying that the evolutionary dynamics of advantageous and deleterious mutations may often depend on one another.
Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.
We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors.
A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html
Motivation: Sequencing-based assays such as ChIP-seq, DNase-seq and MNase-seq have become important tools for genome annotation. In these assays, short sequence reads enriched for loci of interest are mapped to a reference genome to determine their origin. Here, we consider whether false positive peak calls can be caused by particular type of error in the reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy.
Results: Using sequencing data from the 1000 Genomes Project, we systematically scanned the human genome for regions of high sequencing depth. These regions are highly enriched for erroneously inferred transcription factor binding sites, positions of nucleosomes and regions of open chromatin. We suggest a simple masking procedure to remove these regions and reduce false positive calls.
Availability: Files for masking out these regions are available at eqtl.uchicago.edu
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Understanding the genetic mechanisms underlying natural variation in gene expression is a central goal of both medical and evolutionary genetics, and studies of expression quantitative trait loci (eQTLs) have become an important tool for achieving this goal1. Although all eQTL studies so far have assayed messenger RNA levels using expression microarrays, recent advances in RNA sequencing enable the analysis of transcript variation at unprecedented resolution. We sequenced RNA from 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals that have been extensively genotyped by the International HapMap Project2. By pooling data from all individuals, we generated a map of the transcriptional landscape of these cells, identifying extensive use of unannotated untranslated regions and more than 100 new putative protein-coding exons. Using the genotypes from the HapMap project, we identified more than a thousand genes at which genetic variation influences overall expression levels or splicing. We demonstrate that eQTLs near genes generally act by a mechanism involving allele-specific expression, and that variation that influences the inclusion of an exon is enriched within and near the consensus splice sites. Our results illustrate the power of high-throughput sequencing for the joint analysis of variation in transcription, splicing and allele-specific expression across individuals.
Humans inhabit a remarkably diverse range of environments, and adaptation through natural selection has likely played a central role in the capacity to survive and thrive in extreme climates. Unlike numerous studies that used only population genetic data to search for evidence of selection, here we scan the human genome for selection signals by identifying the SNPs with the strongest correlations between allele frequencies and climate across 61 worldwide populations. We find a striking enrichment of genic and nonsynonymous SNPs relative to non-genic SNPs among those that are strongly correlated with these climate variables. Among the most extreme signals, several overlap with those from GWAS, including SNPs associated with pigmentation and autoimmune diseases. Further, we find an enrichment of strong signals in gene sets related to UV radiation, infection and immunity, and cancer. Our results imply that adaptations to climate shaped the spatial distribution of variation in humans.
Classical studies that examined the global distributions of human physiological traits such as pigmentation, basal metabolic rate, and body shape and size suggested that natural selection related to climate has been important during recent human evolutionary history. We scanned the human genome using data for about 650,000 variants in 61 worldwide populations to look for correlations between allele frequencies and 9 climate variables and found evidence for adaptations to climate at the genome-wide level. In addition, we detected compelling signals for individual SNPs involved in pigmentation and immune response, as well as for pathways related to UV radiation, infection and immunity, and cancer. A particularly appealing aspect of this approach is that we identify a set of candidate advantageous SNPs associated with specific biological hypotheses, which will be useful for follow-up testing. We developed an online resource to browse the results of our data analyses, allowing researchers to quickly assess evidence for selection in a particular genomic region and to compare it across several studies.
The modification of DNA by methylation is an important epigenetic mechanism that affects the spatial and temporal regulation of gene expression. Methylation patterns have been described in many contexts within and across a range of species. However, the extent to which changes in methylation might underlie inter-species differences in gene regulation, in particular between humans and other primates, has not yet been studied. To this end, we studied DNA methylation patterns in livers, hearts, and kidneys from multiple humans and chimpanzees, using tissue samples for which genome-wide gene expression data were also available. Using the multi-species gene expression and methylation data for 7,723 genes, we were able to study the role of promoter DNA methylation in the evolution of gene regulation across tissues and species. We found that inter-tissue methylation patterns are often conserved between humans and chimpanzees. However, we also found a large number of gene expression differences between species that might be explained, at least in part, by corresponding differences in methylation levels. In particular, we estimate that, in the tissues we studied, inter-species differences in promoter methylation might underlie as much as 12%–18% of differences in gene expression levels between humans and chimpanzees.
It has long been hypothesized that changes in gene regulation have played an important role in primate evolution. However, despite the wealth of comparative gene expression data, there are still only few studies that focus on the mechanisms underlying inter-primate differences in gene regulation. In particular, we know relatively little about the degree to which changes in epigenetic profiles might explain differences in gene expression levels between primates. To this end, we studied DNA methylation and gene expression levels in livers, hearts, and kidneys from multiple humans and chimpanzees. Using these comparative data, we were able to study the evolution of gene regulation in the context of conservation of or changes in DNA methylation profiles across tissues and species. We found that inter-tissue methylation patterns are often conserved between humans and chimpanzees. In addition, we also found a large number of gene expression differences between species, which might be explained, at least in part, by corresponding differences in methylation levels. We estimate that, in the tissues we studied, inter-species differences in methylation levels might underlie as much as 12%–18% of differences in gene expression levels between humans and chimpanzees.
While the majority of multiexonic human genes show some evidence of alternative splicing, it is unclear what fraction of observed splice forms is functionally relevant. In this study, we examine the extent of alternative splicing in human cells using deep RNA sequencing and de novo identification of splice junctions. We demonstrate the existence of a large class of low abundance isoforms, encompassing approximately 150,000 previously unannotated splice junctions in our data. Newly-identified splice sites show little evidence of evolutionary conservation, suggesting that the majority are due to erroneous splice site choice. We show that sequence motifs involved in the recognition of exons are enriched in the vicinity of unconserved splice sites. We estimate that the average intron has a splicing error rate of approximately 0.7% and show that introns in highly expressed genes are spliced more accurately, likely due to their shorter length. These results implicate noisy splicing as an important property of genome evolution.
Most human genes are split into pieces, such that the protein-coding parts (exons) are separated in the genome by large tracts of non-coding DNA (introns) that must be transcribed and spliced out to create a functional transcript. Variation in splicing reactions can create multiple transcripts from the same gene, yet the function for many of these alternative transcripts is unknown. In this study, we show that many of these transcripts are due to splicing errors which are not preserved over evolutionary time. We estimate that the error rate in the splicing of an intron is about 0.7% and demonstrate that there are two major types of splicing error: errors in the recognition of exons and errors in the precise choice of splice site. These results raise the possibility that variation in levels of alternative splicing across species may in part be to variation in splicing error rate.
There has long been interest in understanding the genetic basis of human adaptation. To what extent are phenotypic differences among human populations driven by natural selection? With the recent arrival of large genome-wide data sets on human variation, there is now unprecedented opportunity for progress on this type of question. Several lines of evidence argue for an important role of positive selection in shaping human variation and differences among populations. These include studies of comparative morphology and physiology, as well as population genetic studies of candidate loci and genome-wide data. However, the data also suggest that it is unusual for strong selection to drive new mutations rapidly to fixation in particular populations (the ‘hard sweep’ model). We argue, instead, for alternatives to the hard sweep model: in particular, polygenic adaptation could allow rapid adaptation while not producing classical signatures of selective sweeps. We close by discussing some of the likely opportunities for progress in the field.
Recently, the observation of a high-frequency private allele, the 9-repeat allele at microsatellite D9S1120, in all sampled Native American and Western Beringian populations has been interpreted as evidence that all modern Native Americans descend primarily from a single founding population. However, this inference assumed that all copies of the 9-repeat allele were identical by descent and that the geographic distribution of this allele had not been influenced by natural selection. To investigate whether these assumptions are satisfied, we genotyped 34 single nucleotide polymorphisms across ∼500 kilobases (kb) around D9S1120 in 21 Native American and Western Beringian populations and 54 other worldwide populations. All chromosomes with the 9-repeat allele share the same haplotypic background in the vicinity of D9S1120, suggesting that all sampled copies of the 9-repeat allele are identical by descent. Ninety-one percent of these chromosomes share the same 76.26 kb haplotype, which we call the “American Modal Haplotype” (AMH). Three observations lead us to conclude that the high frequency and widespread distribution of the 9-repeat allele are unlikely to be the result of positive selection: 1) aside from its association with the 9-repeat allele, the AMH does not have a high frequency in the Americas, 2) the AMH is not unusually long for its frequency compared with other haplotypes in the Americas, and 3) in Latin American mestizo populations, the proportion of Native American ancestry at D9S1120 is not unusual compared with that observed at other genomewide microsatellites. Using a new method for estimating the time to the most recent common ancestor (MRCA) of all sampled copies of an allele on the basis of an estimate of the length of the genealogy descended from the MRCA, we calculate the mean time to the MRCA of the 9-repeat allele to be between 7,325 and 39,900 years, depending on the demographic model used. The results support the hypothesis that all modern Native Americans and Western Beringians trace a large portion of their ancestry to a single founding population that may have been isolated from other Asian populations prior to expanding into the Americas.
private allele; D9S1120; Homo sapiens; native American; migration
Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE).
Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in genes previously known to harbor cis-regulatory variation or known to show uniparental imprinting. Our results have implications for a variety of applications involving detection of alternate alleles from short-read sequence data.
Availability: Scripts, written in Perl and R, for simulating short reads, masking SNP variation in a reference genome and analyzing the simulation output are available upon request from JFD. Raw short read data were deposited in GEO (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE18156.
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis – such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.
Various observations argue for a role of adaptation in recent human evolution, including results from genome-wide studies and analyses of selection signals at candidate genes. Here, we use genome-wide SNP data from the HapMap and CEPH-Human Genome Diversity Panel samples to study the geographic distributions of putatively selected alleles at a range of geographic scales. We find that the average allele frequency divergence is highly predictive of the most extreme FST values across the whole genome. On a broad scale, the geographic distribution of putatively selected alleles almost invariably conforms to population clusters identified using randomly chosen genetic markers. Given this structure, there are surprisingly few fixed or nearly fixed differences between human populations. Among the nearly fixed differences that do exist, nearly all are due to fixation events that occurred outside of Africa, and most appear in East Asia. These patterns suggest that selection is often weak enough that neutral processes—especially population history, migration, and drift—exert powerful influences over the fate and geographic distribution of selected alleles.
Since the beginning of the study of evolution, people have been fascinated by recent human evolution and adaptation. Despite great progress in our understanding of human history, we still know relatively little about the selection pressures and historical factors that have been important over the past 100,000 years. In that time human populations have spread around the world and adapted in a wide variety of ways to the new environments they have encountered. Here, we investigate the genomic signal of these adaptations using a large set of geographically diverse human populations typed at thousands of genetic markers across the genome. We find that patterns at selected loci are predictable from the patterns found at all markers genome-wide. On the basis of this, we argue that selection has been strongly constrained by the historical relationships and gene flow between populations.