A variety of extremely challenging biological sequence analyses were conducted on the XSEDE large shared memory resource Blacklight, using current bioinformatics tools and encompassing a wide range of scientific applications. These include genomic sequence assembly, very large metagenomic sequence assembly, transcriptome assembly, and sequencing error correction. The data sets used in these analyses included uncategorized fungal species, reference microbial data, very large soil and human gut microbiome sequence data, and primate transcriptomes, composed of both short-read and long-read sequence data. A new parallel command execution program was developed on the Blacklight resource to handle some of these analyses. These results, initially reported previously at XSEDE13 and expanded here, represent significant advances for their respective scientific communities. The breadth and depth of the results achieved demonstrate the ease of use, versatility, and unique capabilities of the Blacklight XSEDE resource for scientific analysis of genomic and transcriptomic sequence data, and the power of these resources, together with XSEDE support, in meeting the most challenging scientific problems.
bioinformatics; genomics; genome; transcriptome; de novo assembly; primates; RNA-seq; NGS; metagenome; large shared memory computing; high-performance computing; data-intensive computing
The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of chromosomes conditional on an ARG of chromosomes, an operation we call “threading.” Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.
The unusual and complex correlation structure of population samples of genetic sequences presents a fundamental statistical challenge that pervades nearly all areas of population genetics. Historical recombination events produce an intricate network of intertwined genealogies, which impedes demography inference, the detection of natural selection, association mapping, and other applications. It is possible to capture these complex relationships using a representation called the ancestral recombination graph (ARG), which provides a complete description of coalescence and recombination events in the history of the sample. However, previous methods for ARG inference have not been adequately fast and accurate for practical use with large-scale genomic sequence data. In this article, we introduce a new algorithm for ARG inference that has vastly improved scaling properties. Our algorithm is implemented in a computer program called ARGweaver, which is fast enough to be applied to sequences megabases in length. With the aid of a large computer cluster, ARGweaver can be used to sample full ARGs for entire mammalian genome sequences. We show that ARGweaver performs well in simulation experiments and demonstrate that it can be used to provide new insights about both demographic processes and natural selection when applied to real human genome sequence data.
Complete genome sequences contain valuable information about natural selection, but this
information is difficult to access for short, widely scattered noncoding elements such as
transcription factor binding sites or small noncoding RNAs. Here, we introduce a new
computational method, called Inference of Natural
Selection from Interspersed
Genomically coHerent elemenTs
(INSIGHT), for measuring the influence of natural selection on such elements. INSIGHT uses
a generative probabilistic model to contrast patterns of polymorphism and divergence in
the elements of interest with those in flanking neutral sites, pooling weak information
from many short elements in a manner that accounts for variation among loci in mutation
rates and coalescent times. The method is able to disentangle the contributions of weak
negative, strong negative, and positive selection based on their distinct effects on
patterns of polymorphism and divergence. It obtains information about divergence from
multiple outgroup genomes using a general statistical phylogenetic approach. The INSIGHT
model is efficiently fitted to genome-wide data using an approximate expectation
maximization algorithm. Using simulations, we show that the method can accurately estimate
the parameters of interest even in complex demographic scenarios, and that it
significantly improves on methods based on summary statistics describing polymorphism and
divergence. To demonstrate the usefulness of INSIGHT, we apply it to several classes of
human noncoding RNAs and to GATA2-binding sites in the human genome.
molecular evolution; population genetics; noncoding DNA; regulatory sequences; probabilistic graphical models
RNA polymerase II (Pol II) transcribes hundreds of kilobases of DNA, limiting the production of mRNAs and lncRNAs. We used Global Run-on Sequencing (GRO-seq) to measure the rates of transcription by Pol II following gene activation. Elongation rates vary as much as 4-fold at different genomic loci and in response to two distinct cellular signaling pathways [i.e., 17β-estradiol (E2) and TNFα]. The rates are slowest near the promoter and increase during the first ~15 kb transcribed. Gene body elongation rates correlate with Pol II density, resulting in systematically higher rates of transcript production at genes with higher Pol II density. Pol II dynamics following short inductions indicate that E2 stimulates gene expression by increasing Pol II initiation, whereas TNFα reduces Pol II residence time at pause sites. Collectively, our results identify previously uncharacterized variation in the rate of transcription and highlight elongation as an important, variable, and regulated rate-limiting step during transcription.
Elongation; Estrogen; Estrogen receptor (ERα); Gene regulation; GRO-seq; Nuclear factor kappa B (NF-κB); Rate; RNA Polymerase II; Signal-regulated transcription; Transcription; Tumor necrosis factor alpha (TNFα)
Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease, our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 S. mutans isolates, as well as representative strains of the most closely related species to S. mutans (S. ratti, S. macaccae, and S. criceti), to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5–15%. The core genome consisted of 1,490 genes and the pan-genome approximately 3,296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that the S. mutans population started expanding exponentially approximately 10,000 years ago (95% confidence interval [CI]: 3,268–14,344 years ago), coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of S. mutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.
Streptococcus mutans; demographic inference; cavities; bacterial evolution; pan and core genome; infectious disease
For decades, it has been hypothesized that gene regulation has had central role in human evolution, yet much remains unknown about the genome-wide impact of regulatory mutations. Here we use whole-genome sequences and genome-wide chromatin immunoprecipitation and sequencing data to demonstrate that natural selection has profoundly influenced human transcription factor binding sites since the divergence of humans from chimpanzees 4–6 million years ago. Our analysis uses a new probabilistic method, called INSIGHT, for measuring the influence of selection on collections of short, interspersed noncoding elements. We find that, on average, transcription factor binding sites have experienced somewhat weaker selection than protein-coding genes. However, the binding sites of several transcription factors show clear evidence of adaptation. Several measures of selection are strongly correlated with predicted binding affinity. Overall, regulatory elements seem to contribute substantially to both adaptive substitutions and deleterious polymorphisms with key implications for human evolution and disease.
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we generated high-quality genome sequences from three gray wolves, one from each of the three putative centers of dog domestication, two basal dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. Analysis of these sequences supports a demographic model in which dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow. In dogs, the domestication bottleneck involved at least a 16-fold reduction in population size, a much more severe bottleneck than estimated previously. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was substantially larger than represented by modern wolf populations. We narrow the plausible range for the date of initial dog domestication to an interval spanning 11–16 thousand years ago, predating the rise of agriculture. In light of this finding, we expand upon previous work regarding the increase in copy number of the amylase gene (AMY2B) in dogs, which is believed to have aided digestion of starch in agricultural refuse. We find standing variation for amylase copy number variation in wolves and little or no copy number increase in the Dingo and Husky lineages. In conjunction with the estimated timing of dog origins, these results provide additional support to archaeological finds, suggesting the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that, surprisingly, none of the extant wolf lineages from putative domestication centers is more closely related to dogs, and, instead, the sampled wolves form a sister monophyletic clade. This result, in combination with dog-wolf admixture during the process of domestication, suggests that a re-evaluation of past hypotheses regarding dog origins is necessary.
The process of dog domestication is still poorly understood, largely because no studies thus far have leveraged deeply sequenced whole genomes from wolves and dogs to simultaneously evaluate support for the proposed source regions: East Asia, the Middle East, and Europe. To investigate dog origins, we sequence three wolf genomes from the putative centers of origin, two basal dog breeds (Basenji and Dingo), and a golden jackal as an outgroup. We find that none of the wolf lineages from the hypothesized domestication centers is supported as the source lineage for dogs, and that dogs and wolves diverged 11,000–16,000 years ago in a process involving extensive admixture and that was followed by a bottleneck in wolves. In addition, we investigate the amylase (AMY2B) gene family expansion in dogs, which has recently been suggested as being critical to domestication in response to increased dietary starch. We find standing variation in AMY2B copy number in wolves and show that some breeds, such as Dingo and Husky, lack the AMY2B expansion. This suggests that, at the beginning of the domestication process, dogs may have been characterized by a more carnivorous diet than their modern day counterparts, a diet held in common with early hunter-gatherers.
The prominent role of Horizontal Gene Transfer (HGT) in the evolution of bacteria is now well documented, but few studies have differentiated between evolutionary events that predominantly cause genes in one lineage to be replaced by homologs from another lineage (“replacing HGT”) and events that result in the addition of substantial new genomic material (“additive HGT”). Here in, we make use of the distinct phylogenetic signatures of replacing and additive HGTs in a genome-wide study of the important human pathogen Streptococcus pyogenes (SPY) and its close relatives S. dysgalactiae subspecies equisimilis (SDE) and S. dysgalactiae subspecies dysgalactiae (SDD). Using recently developed statistical models and computational methods, we find evidence for abundant gene flow of both kinds within each of the SPY and SDE clades and of reduced levels of exchange between SPY and SDD. In addition, our analysis strongly supports a pronounced asymmetry in SPY–SDE gene flow, favoring the SPY-to-SDE direction. This finding is of particular interest in light of the recent increase in virulence of pathogenic SDE. We find much stronger evidence for SPY–SDE gene flow among replacing than among additive transfers, suggesting a primary influence from homologous recombination between co-occurring SPY and SDE cells in human hosts. Putative virulence genes are correlated with transfer events, but this correlation is found to be driven by additive, not replacing, HGTs. The genes affected by additive HGTs are enriched for functions having to do with transposition, recombination, and DNA integration, consistent with previous findings, whereas replacing HGTs seen to influence a more diverse set of genes. Additive transfers are also found to be associated with evidence of positive selection. These findings shed new light on the manner in which HGT has shaped pathogenic bacterial genomes.
bacterial evolutionary genomics; recombination; Streptococcus pyogenes; Streptococcus dysgalactiae
To gain insights into evolutionary forces that have shaped the history of Bornean and Sumatran populations of orang-utans, we compare patterns of variation across more than 11 million single nucleotide polymorphisms found by previous mitochondrial and autosomal genome sequencing of 10 wild-caught orang-utans. Our analysis of the mitochondrial data yields a far more ancient split time between the two populations (∼3.4 million years ago) than estimates based on autosomal data (0.4 million years ago), suggesting a complex speciation process with moderate levels of primarily male migration. We find that the distribution of selection coefficients consistent with the observed frequency spectrum of autosomal non-synonymous polymorphisms in orang-utans is similar to the distribution in humans. Our analysis indicates that 35% of genes have evolved under detectable negative selection. Overall, our findings suggest that purifying natural selection, genetic drift, and a complex demographic history are the dominant drivers of genome evolution for the two orang-utan populations.
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
A bacterial transcriptome of the primary etiological agent of human dental caries, Streptococcus mutans, is described here using deep RNA sequencing. Differential expression profiles of the transcriptome in the context of carbohydrate source, and of the presence or absence of the catabolite control protein CcpA, revealed good agreement with previously-published DNA microarrays. In addition, RNA-seq considerably expanded the repertoire of DNA sequences that showed statistically-significant changes in expression as a function of the presence of CcpA and growth carbohydrate. Novel mRNAs and small RNAs were identified, some of which were differentially expressed in conditions tested in this study, suggesting that the function of the S. mutans CcpA protein and the influence of carbohydrate sources has a more substantial impact on gene regulation than previously appreciated. Likewise, the data reveal that the mechanisms underlying prioritization of carbohydrate utilization are more diverse than what is currently understood. Collectively, this study demonstrates the validity of RNA-seq as a potentially more-powerful alternative to DNA microarrays in studying gene regulation in S. mutans because of the capacity of this approach to yield a more precise landscape of transcriptomic changes in response to specific mutations and growth conditions.
The gain, loss, and modification of gene regulatory elements may underlie a significant proportion of phenotypic changes on animal lineages. To investigate the gain of regulatory elements throughout vertebrate evolution we identified genome-wide sets of putative regulatory regions for five vertebrates, including human. These putative regulatory regions are conserved non-exonic elements (CNEEs), which are evolutionarily conserved yet do not overlap any, coding or noncoding, mature transcript. We then inferred the branch on which each CNEE came under selective constraint. This analysis identified three extended periods in the evolution of gene regulatory elements. Early vertebrate evolution was characterized by regulatory gains near transcription factors and developmental genes, but this trend was replaced by innovations near extra-cellular signaling genes, and then innovations near post-translational protein modifiers.
We report the immediate effects of estrogen signaling on the transcriptome of breast cancer cells using Global Run-On and sequencing (GRO-seq). The data were analyzed using a new bioinformatic approach that allowed us to identify transcripts directly from the GRO-seq data. We found that estrogen signaling directly regulates a strikingly large fraction of the transcriptome in a rapid, robust, and unexpectedly transient manner. In addition to protein coding genes, estrogen regulates the distribution and activity of all three RNA polymerases, and virtually every class of non-coding RNA that has been described to date. We also identified a large number of previously undetected estrogen-regulated intergenic transcripts, many of which are found proximal to estrogen receptor binding sites. Collectively, our results provide the most comprehensive measurement of the primary and immediate estrogen effects to date and a resource for understanding rapid signal-dependent transcription in other systems.
Estrogen; Transcriptome; GRO-seq; Gene annotation; Transcript; Signal-regulated transcription
Comparison of related genomes has emerged as a powerful lens for genome interpretation. Here, we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and report constrained elements covering ~4.2% of the genome. We use evolutionary signatures and comparison with experimental datasets to suggest candidate functions for ~60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events, and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements, and ~1,000 primate- and human-accelerated elements. Overlap with disease-associated variants suggests our findings will be relevant for studies of human biology and health.
Besides their value for biomedicine, individual genome sequences are a rich source of information about human evolution. Here we describe an effort to estimate key evolutionary parameters from sequences for six individuals from diverse human populations. We use a Bayesian, coalescent-based approach to extract information about ancestral population sizes, divergence times, and migration rates from inferred genealogies at many neutrally evolving loci from across the genome. We introduce new methods for accommodating gene flow between populations and integrating over possible phasings of diploid genotypes. We also describe a custom pipeline for genotype inference to mitigate biases from heterogeneous sequencing technologies and coverage levels. Our analysis indicates that the San of Southern Africa diverged from other human populations 108–157 thousand years ago (kya), that Eurasians diverged from an ancestral African population 38–64 kya, and that the effective population size of the ancestors of all modern humans was ~9,000.
DNA sequence and local chromatin landscape act jointly to determine transcription factor (TF) binding intensity profiles. To disentangle these influences, we developed an experimental approach, called protein/DNA binding followed by high-throughput sequencing (PB–seq), that allows the binding energy landscape to be characterized genome-wide in the absence of chromatin. We applied our methods to the Drosophila Heat Shock Factor (HSF), which inducibly binds a target DNA sequence element (HSE) following heat shock stress. PB–seq involves incubating sheared naked genomic DNA with recombinant HSF, partitioning the HSF–bound and HSF–free DNA, and then detecting HSF–bound DNA by high-throughput sequencing. We compared PB–seq binding profiles with ones observed in vivo by ChIP–seq and developed statistical models to predict the observed departures from idealized binding patterns based on covariates describing the local chromatin environment. We found that DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in predicting changes in HSF binding affinity. We also investigated the extent to which DNA accessibility, as measured by digital DNase I footprinting data, could be predicted from MNase–seq data and the ChIP–chip profiles for many histone modifications and TFs, and found GAGA element associated factor (GAF), tetra-acetylation of H4, and H4K16 acetylation to be the most predictive covariates. Lastly, we generated an unbiased model of HSF binding sequences, which revealed distinct biophysical properties of the HSF/HSE interaction and a previously unrecognized substructure within the HSE. These findings provide new insights into the interplay between the genomic sequence and the chromatin landscape in determining transcription factor binding intensity.
Transcription factors (TFs) bind DNA to modulate levels of gene expression. TF binding sites change throughout development, in response to environmental stimuli, and different tissues have distinct TF binding profiles. The mechanism by which TFs discriminate between binding sites in a context dependent manner is an area of active research, but it is clear that the chromatin environment in which potential binding sites reside strongly influences binding. This study used the Heat Shock TF (HSF) to study the effect chromatin has upon induced HSF binding. We implemented an experimental technique to quantify all potential HSF binding sites in the genome. These data were incorporated into a modeling framework along with chromatin landscape information prior to HSF binding to accurately predict the intensities of inducible HSF binding sites. DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in the model. The binding data enabled the development of a more complete HSF/DNA interaction model, providing insight into the biophysical interaction of HSF trimer subunits and target DNA pentamers.
The ratio of genetic diversity on chromosome X to that on the autosomes is sensitive to both natural selection and demography. Based on whole-genome sequences of 69 females, we report that while this ratio increases with genetic distance from genes across populations, it is lower in Europeans than in West Africans independent of proximity to genes. This relative reduction is most parsimoniously explained by differences in demographic history without the need to invoke natural selection.
The PHylogenetic Analysis with Space/Time models (PHAST) software package consists of a collection of command-line programs and supporting libraries for comparative genomics. PHAST is best known as the engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser. However, it also includes several other tools for phylogenetic modeling and functional element identification, as well as utilities for manipulating alignments, trees and genomic annotations. PHAST has been in development since 2002 and has now been downloaded more than 1000 times, but so far it has been released only as provisional (‘beta’) software. Here, we describe the first official release (v1.0) of PHAST, with improved stability, portability and documentation and several new features. We outline the components of the package and detail recent improvements. In addition, we introduce a new interface to the PHAST libraries from the R statistical computing environment, called RPHAST, and illustrate its use in a series of vignettes. We demonstrate that RPHAST can be particularly useful in applications involving both large-scale phylogenomics and complex statistical analyses. The R interface also makes the PHAST libraries acccessible to non-C programmers, and is useful for rapid prototyping. PHAST v1.0 and RPHAST v1.0 are available for download at http://compgen.bscb.cornell.edu/phast, under the terms of an unrestrictive BSD-style license. RPHAST can also be obtained from the Comprehensive R Archive Network (CRAN; http://cran.r-project.org).
statistical phylogenetics; functional element identification
GC-biased gene conversion (gBGC) is a recombination-associated evolutionary process that accelerates the fixation of guanine or cytosine alleles, regardless of their effects on fitness. gBGC can increase the overall rate of substitutions, a hallmark of positive selection. Many fast-evolving genes and noncoding sequences in the human genome have GC-biased substitution patterns, suggesting that gBGC—in contrast to adaptive processes—may have driven the human changes in these sequences. To investigate this hypothesis, we developed a substitution model for DNA sequence evolution that quantifies the nonlinear interacting effects of selection and gBGC on substitution rates and patterns. Based on this model, we used a series of lineage-specific likelihood ratio tests to evaluate sequence alignments for evidence of changes in mode of selection, action of gBGC, or both. With a false positive rate of less than 5% for individual tests, we found that the majority (76%) of previously identified human accelerated regions are best explained without gBGC, whereas a substantial minority (19%) are best explained by the action of gBGC alone. Further, more than half (55%) have substitution rates that significantly exceed local estimates of the neutral rate, suggesting that these regions may have been shaped by positive selection rather than by relaxation of constraint. By distinguishing the effects of gBGC, relaxation of constraint, and positive selection we provide an integrated analysis of the evolutionary forces that shaped the fastest evolving regions of the human genome, which facilitates the design of targeted functional studies of adaptation in humans.
genome evolution; conserved noncoding elements; lineage-specific adaption; human accelerated regions; GC-biased gene conversion
“Orangutan” is derived from the Malay term “man of the forest” and aptly describes the Southeast Asian great apes native to Sumatra and Borneo. The orangutan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, thereby providing an informative perspective on hominid evolution. Here we present a Sumatran orangutan draft genome assembly and short read sequence data from five Sumatran and five Bornean orangutan genomes. Our analyses reveal that, compared to other primates, the orangutan genome has many unique features. Structural evolution of the orangutan genome has proceeded much more slowly than other great apes, evidenced by fewer rearrangements, less segmental duplication, a lower rate of gene family turnover and surprisingly quiescent Alu repeats, which have played a major role in restructuring other primate genomes. We also describe the first primate polymorphic neocentromere, found in both Pongo species, emphasizing the gradual evolution of orangutan genome structure. Orangutans have extremely low energy usage for a eutherian mammal1, far lower than their hominid relatives. Adding their genome to the repertoire of sequenced primates illuminates new signals of positive selection in several pathways including glycolipid metabolism. From the population perspective, both Pongo species are deeply diverse; however, Sumatran individuals possess greater diversity than their Bornean counterparts, and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400k years ago (ya), is more recent than most previous studies and underscores the complexity of the orangutan speciation process. Despite a smaller modern census population size, the Sumatran effective population size (Ne) expanded exponentially relative to the ancestral Ne after the split, while Bornean Ne declined over the same period. Overall, the resources and analyses presented here offer new opportunities in evolutionary genomics, insights into hominid biology, and an extensive database of variation for conservation efforts.
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
Comparative genomics of closely related bacterial species with different pathogenesis and host preference can provide a means of identifying the specifics of adaptive differences. Streptococcus dysgalactiae (SD) is comprised of two subspecies: S. dysgalactiae subsp. equisimilis is both a human commensal organism and a human pathogen, and S. dysgalactiae subsp. dysgalactiae is strictly an animal pathogen. Here, we present complete genome sequences for both taxa, with analyses involving other species of Streptococcus but focusing on adaptation in the SD species group. We found little evidence for enrichment in biochemical categories of genes carried by each SD strain, however, differences in the virulence gene repertoire were apparent. Some of the differences could be ascribed to prophage and integrative conjugative elements. We identified approximately 9% of the nonrecombinant core genome to be under positive selection, some of which involved known virulence factors in other bacteria. Analyses of proteomes by pooling data across genes, by biochemical category, clade, or branch, provided evidence for increased rates of evolution in several gene categories, as well as external branches of the tree. Promoters were primarily evolving under purifying selection but with certain categories of genes evolving faster. Many of these fast-evolving categories were the same as those associated with rapid evolution in proteins. Overall, these results suggest that adaptation to changing environments and new hosts in the SD species group has involved the acquisition of key virulence genes along with selection of orthologous protein-coding loci and operon promoters.
Streptococcus dysgalactiae subsp. equisimilis; S. dysgalactiae subsp. dysgalactiae; gene content; molecular adaptation; promoter evolution
The largest genetic study to date of morphology in domestic dogs identifies genes
controlling nearly 100 morphological traits and identifies important trends in
phenotypic variation within this species.
Domestic dogs exhibit tremendous phenotypic diversity, including a greater
variation in body size than any other terrestrial mammal. Here, we generate a
high density map of canine genetic variation by genotyping 915 dogs from 80
domestic dog breeds, 83 wild canids, and 10 outbred African shelter dogs across
60,968 single-nucleotide polymorphisms (SNPs). Coupling this genomic resource
with external measurements from breed standards and individuals as well as
skeletal measurements from museum specimens, we identify 51 regions of the dog
genome associated with phenotypic variation among breeds in 57 traits. The
complex traits include average breed body size and external body dimensions and
cranial, dental, and long bone shape and size with and without allometric
scaling. In contrast to the results from association mapping of quantitative
traits in humans and domesticated plants, we find that across dog breeds, a
small number of quantitative trait loci (≤3) explain the majority of
phenotypic variation for most of the traits we studied. In addition, many
genomic regions show signatures of recent selection, with most of the highly
differentiated regions being associated with breed-defining traits such as body
size, coat characteristics, and ear floppiness. Our results demonstrate the
efficacy of mapping multiple traits in the domestic dog using a database of
genotyped individuals and highlight the important role human-directed selection
has played in altering the genetic architecture of key traits in this important
Dogs offer a unique system for the study of genes controlling morphology. DNA
from 915 dogs from 80 domestic breeds, as well as a set of feral dogs, was
tested at over 60,000 points of variation and the dataset analyzed using novel
methods to find loci regulating body size, head shape, leg length, ear position,
and a host of other traits. Because each dog breed has undergone strong
selection by breeders to have a particular appearance, there is a strong
footprint of selection in regions of the genome that are important for
controlling traits that define each breed. These analyses identified new regions
of the genome, or loci, that are important in controlling body size and shape.
Our results, which feature the largest number of domestic dogs studied at such a
high level of genetic detail, demonstrate the power of the dog as a model for
finding genes that control the body plan of mammals. Further, we show that the
remarkable diversity of form in the dog, in contrast to some other species
studied to date, appears to have a simple genetic basis dominated by genes of
Clusters of genes that evolved from single progenitors via repeated segmental duplications present significant challenges to the generation of a truly complete human genome sequence. Such clusters can confound both accurate sequence assembly and downstream computational analysis, yet they represent a hotbed of functional innovation, making them of extreme interest. We have developed an algorithm for reconstructing the evolutionary history of gene clusters using only human genomic sequence data, which allows the tempo of large-scale evolutionary events in human gene clusters to be estimated. We further propose an extension of the method to simultaneously reconstructing the evolutionary histories of orthologous gene clusters in multiple primates, which will facilitate primate comparative sequencing studies that aim to reconstruct their evolutionary history more fully.
alignment; computational molecular biology; genetic mapping; haplotypes; Markov chains