Heart development is exquisitely sensitive to the precise temporal regulation of thousands of genes that govern developmental decisions during differentiation. However, we currently lack a detailed understanding of how chromatin and gene expression patterns are coordinated during developmental transitions in the cardiac lineage. Here, we interrogated the transcriptome and several histone modifications across the genome during defined stages of cardiac differentiation. We find distinct chromatin patterns that are coordinated with stage-specific expression of functionally related genes, including many human disease-associated genes. Moreover, we discover a novel pre-activation chromatin pattern at the promoters of genes associated with heart development and cardiac function. We further identify stage-specific distal enhancer elements and find enriched DNA binding motifs within these regions that predict sets of transcription factors that orchestrate cardiac differentiation. Together, these findings form a basis for understanding developmentally regulated chromatin transitions during lineage commitment and the molecular etiology of congenital heart disease.
Even in genomes lacking operons, a gene's position in the genome influences its potential for expression. The mechanisms by which adjacent genes are co-expressed are still not completely understood. Using lactation and the mammary gland as a model system, we explore the hypothesis that chromatin state contributes to the co-regulation of gene neighborhoods. The mammary gland represents a unique evolutionary model, due to its recent appearance, in the context of vertebrate genomes. An understanding of how the mammary gland is regulated to produce milk is also of biomedical and agricultural importance for human lactation and dairying. Here, we integrate epigenomic and transcriptomic data to develop a comprehensive regulatory model. Neighborhoods of mammary-expressed genes were determined using expression data derived from pregnant and lactating mice and a neighborhood scoring tool, G-NEST. Regions of open and closed chromatin were identified by ChIP-Seq of histone modifications H3K36me3, H3K4me2, and H3K27me3 in the mouse mammary gland and liver tissue during lactation. We found that neighborhoods of genes in regions of uniquely active chromatin in the lactating mammary gland, compared with liver tissue, were extremely rare. Rather, genes in most neighborhoods were suppressed during lactation as reflected in their expression levels and their location in regions of silenced chromatin. Chromatin silencing was largely shared between the liver and mammary gland during lactation, and what distinguished the mammary gland was mainly a small tissue-specific repertoire of isolated, expressed genes. These findings suggest that an advantage of the neighborhood organization is in the collective repression of groups of genes via a shared mechanism of chromatin repression. Genes essential to the mammary gland's uniqueness are isolated from neighbors, and likely have less tolerance for variation in expression, properties they share with genes responsible for an organism's survival.
Genomic approaches to characterizing bacterial communities are revealing significant differences in diversity and composition between environments. But bacterial distributions have not been mapped at a global scale. Although current community surveys are way too sparse to map global diversity patterns directly, there is now sufficient data to fit accurate models of how bacterial distributions vary across different environments and to make global scale maps from these models. We apply this approach to map the global distributions of bacteria in marine surface waters. Our spatially and temporally explicit predictions suggest that bacterial diversity peaks in temperate latitudes across the world's oceans. These global peaks are seasonal, occurring 6 months apart in the two hemispheres, in the boreal and austral winters. This pattern is quite different from the tropical, seasonally consistent diversity patterns observed for most macroorganisms. However, like other marine organisms, surface water bacteria are particularly diverse in regions of high human environmental impacts on the oceans. Our maps provide the first picture of bacterial distributions at a global scale and suggest important differences between the diversity patterns of bacteria compared with other organisms.
bacteria; marine; species distribution model; niche model; diversity gradient; range map
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
Sequence-based phylogenetic trees are a well-established tool for characterizing diversity of both macroorganisms and microorganisms. Phylogenetic methods have recently been applied to shotgun metagenomic data from microbial communities, particularly with the aim of classifying reads. But the accuracy of gene-family phylogenies that characterize evolutionary relationships among short, non-overlapping sequencing reads has not been thoroughly evaluated.
To quantify errors in metagenomic read trees, we developed MetaPASSAGE, a software pipeline to generate in silico bacterial communities, simulate a sample of shotgun reads from a gene family represented in the community, orient or translate reads, and produce a profile-based alignment of the reads from which a gene-family phylogenetic tree can be built. We applied MetaPASSAGE to a variety of RNA and protein-coding gene families, built trees using a range of different phylogenetic methods, and compared the resulting trees using topological and branch-length error metrics. We identified read length as one of the major sources of error. Because phylogenetic methods use a reference database of full-length sequences from the gene family to guide construction of alignments and trees, we found that error can also be substantially reduced through increasing the size and diversity of the reference database. Finally, UniFrac analysis, which compares metagenomic samples based on a summary statistic computed over all branches in a read tree, is very robust to the level of error we observe.
Bacterial community diversity can be quantified using phylogenetic approaches applied to shotgun metagenomic data. As sequencing reads get longer and more genomes across the bacterial tree of life are sequenced, the accuracy of this approach will continue to improve, opening the door to more applications.
Phylogenetics; Metagenomics; Simulations
New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.
We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as “Sifting Families,” or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology–based analyses.
We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).
In previous studies, gene neighborhoods—spatial clusters of co-expressed genes in the genome—have been defined using arbitrary rules such as requiring adjacency, a minimum number of genes, a fixed window size, or a minimum expression level. In the current study, we developed a Gene Neighborhood Scoring Tool (G-NEST) which combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all possible window sizes simultaneously.
Using G-NEST on atlases of mouse and human tissue expression data, we found that large neighborhoods of ten or more genes are extremely rare in mammalian genomes. When they do occur, neighborhoods are typically composed of families of related genes. Both the highest scoring and the largest neighborhoods in mammalian genomes are formed by tandem gene duplication. Mammalian gene neighborhoods contain highly and variably expressed genes. Co-localized noisy gene pairs exhibit lower evolutionary conservation of their adjacent genome locations, suggesting that their shared transcriptional background may be disadvantageous. Genes that are essential to mammalian survival and reproduction are less likely to occur in neighborhoods, although neighborhoods are enriched with genes that function in mitosis. We also found that gene orientation and protein-protein interactions are partially responsible for maintenance of gene neighborhoods.
Our experiments using G-NEST confirm that tandem gene duplication is the primary driver of non-random gene order in mammalian genomes. Non-essentiality, co-functionality, gene orientation, and protein-protein interactions are additional forces that maintain gene neighborhoods, especially those formed by tandem duplicates. We expect G-NEST to be useful for other applications such as the identification of core regulatory modules, common transcriptional backgrounds, and chromatin domains. The software is available at http://docpollard.org/software.html
Computational biology; Genomics; Gene expression; Gene duplication; Transcription; Cluster analysis; Gene neighborhood; Gene cluster; Bioinformatics; Evolution
The evolutionary history of a protein reflects the functional history of its ancestors. Recent phylogenetic studies identified distinct evolutionary signatures that characterize proteins involved in cancer, Mendelian disease, and different ontogenic stages. Despite the potential to yield insight into the cellular functions and interactions of proteins, such comparative phylogenetic analyses are rarely performed, because they require custom algorithms. We developed ProteinHistorian to make tools for performing analyses of protein origins widely available. Given a list of proteins of interest, ProteinHistorian estimates the phylogenetic age of each protein, quantifies enrichment for proteins of specific ages, and compares variation in protein age with other protein attributes. ProteinHistorian allows flexibility in the definition of protein age by including several algorithms for estimating ages from different databases of evolutionary relationships. We illustrate the use of ProteinHistorian with three example analyses. First, we demonstrate that proteins with high expression in human, compared to chimpanzee and rhesus macaque, are significantly younger than those with human-specific low expression. Next, we show that human proteins with annotated regulatory functions are significantly younger than proteins with catalytic functions. Finally, we compare protein length and age in many eukaryotic species and, as expected from previous studies, find a positive, though often weak, correlation between protein age and length. ProteinHistorian is available through a web server with an intuitive interface and as a set of command line tools; this allows biologists and bioinformaticians alike to integrate these approaches into their analysis pipelines. ProteinHistorian's modular, extensible design facilitates the integration of new datasets and algorithms. The ProteinHistorian web server, source code, and pre-computed ages for 32 eukaryotic genomes are freely available under the GNU public license at http://lighthouse.ucsf.edu/ProteinHistorian/.
The human gut harbors thousands of bacterial taxa. A profusion of metagenomic sequence data has been generated from human stool samples in the last few years, raising the question of whether more taxa remain to be identified. We assessed metagenomic data generated by the Human Microbiome Project Consortium to determine if novel taxa remain to be discovered in stool samples from healthy individuals. To do this, we established a rigorous bioinformatics pipeline that uses sequence data from multiple platforms (Illumina GAIIX and Roche 454 FLX Titanium) and approaches (whole-genome shotgun and 16S rDNA amplicons) to validate novel taxa. We applied this approach to stool samples from 11 healthy subjects collected as part of the Human Microbiome Project. We discovered several low-abundance, novel bacterial taxa, which span three major phyla in the bacterial tree of life. We determined that these taxa are present in a larger set of Human Microbiome Project subjects and are found in two sampling sites (Houston and St. Louis). We show that the number of false-positive novel sequences (primarily chimeric sequences) would have been two orders of magnitude higher than the true number of novel taxa without validation using multiple datasets, highlighting the importance of establishing rigorous standards for the identification of novel taxa in metagenomic data. The majority of novel sequences are related to the recently discovered genus Barnesiella, further encouraging efforts to characterize the members of this genus and to study their roles in the microbial communities of the gut. A better understanding of the effects of less-abundant bacteria is important as we seek to understand the complex gut microbiome in healthy individuals and link changes in the microbiome to disease.
Several previous comparisons of the human genome with other primate and vertebrate genomes identified genomic regions that are highly conserved in vertebrate evolution but fast-evolving on the human lineage. These human accelerated regions (HARs) may be regions of past adaptive evolution in humans. Alternatively, they may be the result of non-adaptive processes, such as biased gene conversion. We captured and sequenced DNA from a collection of previously published HARs using DNA from an Iberian Neandertal. Combining these new data with shotgun sequence from the Neandertal and Denisova draft genomes, we determine at least one archaic hominin allele for 84% of all positions within HARs. We find that 8% of HAR substitutions are not observed in the archaic hominins and are thus recent in the sense that the derived allele had not come to fixation in the common ancestor of modern humans and archaic hominins. Further, we find that recent substitutions in HARs tend to have come to fixation faster than substitutions elsewhere in the genome and that substitutions in HARs tend to cluster in time, consistent with an episodic rather than a clock-like process underlying HAR evolution. Our catalog of sequence changes in HARs will help prioritize them for functional studies of genomic elements potentially responsible for modern human adaptations.
The PHylogenetic Analysis with Space/Time models (PHAST) software package consists of a collection of command-line programs and supporting libraries for comparative genomics. PHAST is best known as the engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser. However, it also includes several other tools for phylogenetic modeling and functional element identification, as well as utilities for manipulating alignments, trees and genomic annotations. PHAST has been in development since 2002 and has now been downloaded more than 1000 times, but so far it has been released only as provisional (‘beta’) software. Here, we describe the first official release (v1.0) of PHAST, with improved stability, portability and documentation and several new features. We outline the components of the package and detail recent improvements. In addition, we introduce a new interface to the PHAST libraries from the R statistical computing environment, called RPHAST, and illustrate its use in a series of vignettes. We demonstrate that RPHAST can be particularly useful in applications involving both large-scale phylogenomics and complex statistical analyses. The R interface also makes the PHAST libraries acccessible to non-C programmers, and is useful for rapid prototyping. PHAST v1.0 and RPHAST v1.0 are available for download at http://compgen.bscb.cornell.edu/phast, under the terms of an unrestrictive BSD-style license. RPHAST can also be obtained from the Comprehensive R Archive Network (CRAN; http://cran.r-project.org).
statistical phylogenetics; functional element identification
GC-biased gene conversion (gBGC) is a recombination-associated evolutionary process that accelerates the fixation of guanine or cytosine alleles, regardless of their effects on fitness. gBGC can increase the overall rate of substitutions, a hallmark of positive selection. Many fast-evolving genes and noncoding sequences in the human genome have GC-biased substitution patterns, suggesting that gBGC—in contrast to adaptive processes—may have driven the human changes in these sequences. To investigate this hypothesis, we developed a substitution model for DNA sequence evolution that quantifies the nonlinear interacting effects of selection and gBGC on substitution rates and patterns. Based on this model, we used a series of lineage-specific likelihood ratio tests to evaluate sequence alignments for evidence of changes in mode of selection, action of gBGC, or both. With a false positive rate of less than 5% for individual tests, we found that the majority (76%) of previously identified human accelerated regions are best explained without gBGC, whereas a substantial minority (19%) are best explained by the action of gBGC alone. Further, more than half (55%) have substitution rates that significantly exceed local estimates of the neutral rate, suggesting that these regions may have been shaped by positive selection rather than by relaxation of constraint. By distinguishing the effects of gBGC, relaxation of constraint, and positive selection we provide an integrated analysis of the evolutionary forces that shaped the fastest evolving regions of the human genome, which facilitates the design of targeted functional studies of adaptation in humans.
genome evolution; conserved noncoding elements; lineage-specific adaption; human accelerated regions; GC-biased gene conversion
Conserved noncoding elements (CNEs) in vertebrate genomes often act as developmental enhancers, but a critical issue is how well orthologous CNE sequences retain the same activity in their respective species, a characteristic important for generalization of model organism studies. To quantify how well CNE enhancer activity has been preserved, we compared the anatomy-specific activities of 41 zebra fish CNEs in zebra fish embryos with the activities of orthologous human CNEs in mouse embryos. We found that 13/41 (∼30%) of the orthologous CNE pairs exhibit conserved positive activity in zebra fish and mouse. Conserved positive activity is only weakly associated with either sequence conservation or the absence of bases undergoing accelerated evolution. A stronger effect is that disparate activity is associated with transcription factor binding site divergence. To distinguish the contributions of cis- versus trans-regulatory changes, we analyzed 13 CNEs in a three-way experimental comparison: human CNE tested in zebra fish, human CNE tested in mouse, and orthologous zebra fish CNE tested in zebra fish. Both cis- and trans-changes affect a significant fraction of CNEs, although human and zebra fish sequences exhibit disparate activity in zebra fish (indicating cis regulatory changes) twice as often as human sequences show disparate activity when tested in mouse and zebra fish (indicating trans regulatory changes). In all four cases where the zebra fish and human CNE display a similar expression pattern in zebra fish, the human CNE also displays a similar expression pattern in mouse. This suggests that the endogenous enhancer activity of ∼30% of human CNEs can be determined from experiments in zebra fish alone, and to identify these CNEs, both the zebra fish and the human sequences should be tested.
enhancer; conserved noncoding elements; evolution; development; transcription factor; expression; cis; trans
Phylogenetic diversity—patterns of phylogenetic relatedness among organisms in ecological communities—provides important insights into the mechanisms underlying community assembly. Studies that measure phylogenetic diversity in microbial communities have primarily been limited to a single marker gene approach, using the small subunit of the rRNA gene (SSU-rRNA) to quantify phylogenetic relationships among microbial taxa. In this study, we present an approach for inferring phylogenetic relationships among microorganisms based on the random metagenomic sequencing of DNA fragments. To overcome challenges caused by the fragmentary nature of metagenomic data, we leveraged fully sequenced bacterial genomes as a scaffold to enable inference of phylogenetic relationships among metagenomic sequences from multiple phylogenetic marker gene families. The resulting metagenomic phylogeny can be used to quantify the phylogenetic diversity of microbial communities based on metagenomic data sets. We applied this method to understand patterns of microbial phylogenetic diversity and community assembly along an oceanic depth gradient, and compared our findings to previous studies of this gradient using SSU-rRNA gene and metagenomic analyses. Bacterial phylogenetic diversity was highest at intermediate depths beneath the ocean surface, whereas taxonomic diversity (diversity measured by binning sequences into taxonomically similar groups) showed no relationship with depth. Phylogenetic diversity estimates based on the SSU-rRNA gene and the multi-gene metagenomic phylogeny were broadly concordant, suggesting that our approach will be applicable to other metagenomic data sets for which corresponding SSU-rRNA gene sequences are unavailable. Our approach opens up the possibility of using metagenomic data to study microbial diversity in a phylogenetic context.
Fast evolving regions of many metazoan genomes show a bias toward substitutions that change weak (A,T) into strong (G,C) base pairs. Single-nucleotide polymorphisms (SNPs) do not share this pattern, suggesting that it results from biased fixation rather than biased mutation. Supporting this hypothesis, analyses of polymorphism in specific regions of the human genome have identified a positive correlation between weak to strong (W→S) SNPs and derived allele frequency (DAF), suggesting that SNPs become increasingly GC biased over time, especially in regions of high recombination. Using polymorphism data generated by the 1000 Genomes Project from 179 individuals from 4 human populations, we evaluated the extent and distribution of ongoing GC-biased evolution in the human genome. We quantified GC fixation bias by comparing the DAFs of W→S mutations and S→W mutations using a Mann–Whitney U test. Genome-wide, W→S SNPs have significantly higher DAFs than S→W SNPs. This pattern is widespread across the human genome but varies in magnitude along the chromosomes. We found extreme GC-biased evolution in neighborhoods of recombination hot spots, a significant correlation between GC bias and recombination rate, and an inverse correlation between GC bias and chromosome arm length. These findings demonstrate the presence of ongoing fixation bias favoring G and C alleles throughout the human genome and suggest that the bias is caused by a recombination-associated process, such as GC-biased gene conversion.
fixation bias; weak to strong; biased gene conversion; polymorphism; 1000 Genomes Project
The fastest-evolving regions in the human and chimpanzee genomes show a remarkable excess of weak (A,T) to strong (G,C) nucleotide substitutions since divergence from their common ancestor. We investigated the phylogenetic extent and possible causes of this weak to strong (W→S) bias in divergent sequences (BDS) using recently sequenced genomes and recombination maps from eight trios of eukaryotic species. To quantify evidence for BDS, we inferred substitution histories using an efficient maximum likelihood approach with a context-dependent evolutionary model. We then annotated all lineage-specific substitutions in terms of W→S bias and density on the chromosomes. Finally, we used the inferred substitutions to calculate a BDS score—a log odds ratio between substitution type and density—and assessed its statistical significance with Fisher's exact test. Applying this approach, we found significant BDS in the coding and noncoding sequence of human, mouse, dog, stickleback, fruit fly, and worm. We also observed a significant lack of W→S BDS in chicken and yeast. The BDS score varies between species and across the chromosomes within each species. It is most strongly correlated with different genomic features in different species, but a strong correlation with recombination rates is found in several species. Our results demonstrate that a W→S substitution bias in fast-evolving sequences is a widespread phenomenon. The patterns of BDS observed suggest that a recombination-associated process, such as GC-biased gene conversion, is involved in the production of the bias in many species, but the strength of the BDS likely depends on many factors, including genome stability, variability in recombination rate over time and across the genome, the frequency of meiosis, and the amount of outcrossing in each species.
fast-evolving sequence; clustered substitution; fixation bias; genome analysis; biased gene conversion
SIRT1 and SIRT3 are NAD+-dependent protein deacetylases that are evolutionarily conserved across mammals. These proteins are located in the cytoplasm/nucleus and mitochondria, respectively. Previous reports demonstrated that human SIRT1 deacetylates Acetyl-CoA Synthase 1 (AceCS1) in the cytoplasm, whereas SIRT3 deacetylates the homologous Acetyl-CoA Synthase 2 (AceCS2) in the mitochondria. We recently showed that 3-hydroxy-3-methylglutaryl CoA synthase 2 (HMGCS2) is deacetylated by SIRT3 in mitochondria, and we demonstrate here that SIRT1 deacetylates the homologous 3-hydroxy-3-methylglutaryl CoA synthase 1 (HMGCS1) in the cytoplasm. This novel pattern of substrate homology between cytoplasmic SIRT1 and mitochondrial SIRT3 suggests that considering evolutionary relationships between the sirtuins and their substrates may help to identify and understand the functions and interactions of this gene family. In this perspective, we take a first step by characterizing the evolutionary history of the sirtuins and these substrate families.
sirtuins; evolution; deacetylases; aging
Dominant mutations in cardiac transcription factor genes cause human inherited congenital heart defects (CHDs); however, their molecular basis is not understood. Interactions between transcription factors and the Brg1/Brm-associated factor (BAF) chromatin remodelling complex suggest potential mechanisms; however, the role of BAF complexes in cardiogenesis is not known. In this study, we show that dosage of Brg1 is critical for mouse and zebrafish cardiogenesis. Disrupting the balance between Brg1 and disease-causing cardiac transcription factors, including Tbx5, Tbx20 and Nkx2–5, causes severe cardiac anomalies, revealing an essential allelic balance between Brg1 and these cardiac transcription factor genes. This suggests that the relative levels of transcription factors and BAF complexes are important for heart development, which is supported by reduced occupancy of Brg1 at cardiac gene promoters in Tbx5 haploinsufficient hearts. Our results reveal complex dosage-sensitive interdependence between transcription factors and BAF complexes, providing a potential mechanism underlying transcription factor haploinsufficiency, with implications for multigenic inheritance of CHDs.
The developmental mechanisms through which the cerebral cortex increased in size and complexity during primate evolution are essentially unknown. To uncover genetic networks active in the developing cerebral cortex, we combined three-dimensional reconstruction of human fetal brains at midgestation and whole genome expression profiling. This novel approach enabled transcriptional characterization of neurons from accurately defined cortical regions containing presumptive Broca and Wernicke language areas, as well as surrounding associative areas. We identified hundreds of genes displaying differential expression between the two regions, but no significant difference in gene expression between left and right hemispheres. Validation by qRTPCR and in situ hybridization confirmed the robustness of our approach and revealed novel patterns of area- and layer-specific expression throughout the developing cortex. Genes differentially expressed between cortical areas were significantly associated with fast-evolving non-coding sequences harboring human-specific substitutions that could lead to divergence in their repertoires of transcription factor binding sites. Strikingly, while some of these sequences were accelerated in the human lineage only, many others were accelerated in chimpanzee and/or mouse lineages, indicating that genes important for cortical development may be particularly prone to changes in transcriptional regulation across mammals. Genes differentially expressed between cortical regions were also enriched for transcriptional targets of FoxP2, a key gene for the acquisition of language abilities in humans. Our findings point to a subset of genes with a unique combination of cortical areal expression and evolutionary patterns, suggesting that they play important roles in the transcriptional network underlying human-specific neural traits.
Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles. Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the biosphere currently hidden from PCR-based surveys of diversity?
Microorganisms comprise the majority of the biodiversity on the planet. Because the overwhelming majority of microbes are not readily cultured in the laboratory, researchers often rely on PCR-based investigations of genomic sequence to characterize microbial diversity. These analyses have dramatically expanded our understanding of biodiversity, but due to methodological biases PCR-based approaches may only reveal part of the microbial biosphere. Shotgun sequencing of environmental DNA, known as metagenomics, avoids the biases associated with targeted amplification of genomic sequence and can provide insight into the diversity hidden from traditional investigations. However, the fragmentary, non-overlapping nature of shotgun sequence data makes it intractable to analyze with existing tools. Here, we present PhylOTU, a novel computational method that enables accurate characterization of microbial diversity from metagenomic data. We process over 10 million metagenomic sequences obtained from the global open ocean to identify novel Bacterial taxa and reveal the presence of microorganisms overlooked by investigation of PCR-based sequences from the same samples. These results suggest that to fully characterize microbial biodiversity requires a novel bioinformatics toolbox for analysis of shotgun metagenomic data.
Genes are created by a variety of evolutionary processes, some of which generate duplicate copies of an entire gene, while others rearrange pre-existing genetic elements or co-opt previously non-coding sequence to create genes with 'novel' sequences. These novel genes are thought to contribute to distinct phenotypes that distinguish organisms. The creation, evolution, and function of duplicated genes are well-studied; however, the genesis and early evolution of novel genes are not well-characterized. We developed a computational approach to investigate these issues by integrating genome-wide comparative phylogenetic analysis with functional and interaction data derived from small-scale and high-throughput experiments.
We examine the function and evolution of new genes in the yeast Saccharomyces cerevisiae. We observed significant differences in the functional attributes and interactions of genes created at different times and by different mechanisms. Novel genes are initially less integrated into cellular networks than duplicate genes, but they appear to gain functions and interactions more quickly than duplicates. Recently created duplicated genes show evidence of adapting existing functions to environmental changes, while young novel genes do not exhibit enrichment for any particular functions. Finally, we found a significant preference for genes to interact with other genes of similar age and origin.
Our results suggest a strong relationship between how and when genes are created and the roles they play in the cell. Overall, genes tend to become more integrated into the functional networks of the cell with time, but the dynamics of this process differ significantly between duplicate and novel genes.
Gene expression divergence and chromosomal rearrangements have been put forward as major contributors to phenotypic differences between closely related species. It has also been established that duplicated genes show enhanced rates of positive selection in their amino acid sequences. If functional divergence is largely due to changes in gene expression, it follows that regulatory sequences in duplicated loci should also evolve rapidly. To investigate this hypothesis, we performed likelihood ratio tests (LRTs) on all noncoding loci within 5 kb of every transcript in the human genome and identified sequences with increased substitution rates in the human lineage since divergence from Old World Monkeys. The fraction of rapidly evolving loci is significantly higher nearby genes that duplicated in the common ancestor of humans and chimps compared with nonduplicated genes. We also conducted a genome-wide scan for nucleotide substitutions predicted to affect transcription factor binding. Rates of binding site divergence are elevated in noncoding sequences of duplicated loci with accelerated substitution rates. Many of the genes associated with these fast-evolving genomic elements belong to functional categories identified in previous studies of positive selection on amino acid sequences. In addition, we find enrichment for accelerated evolution nearby genes involved in establishment and maintenance of pregnancy, processes that differ significantly between humans and monkeys. Our findings support the hypothesis that adaptive evolution of the regulation of duplicated genes has played a significant role in human evolution.
accelerated substitution; noncoding sequence; gene duplication
Regions of the genome that have been the target of positive selection specifically along the human lineage are of special importance in human biology. We used high throughput sequencing combined with methods to enrich human genomic samples for particular targets to obtain the sequence of 22 chromosomal samples at high depth in 40 kb neighborhoods of 49 previously identified 100–400 bp elements that show evidence for human accelerated evolution. In addition to selection, the pattern of nucleotide substitutions in several of these elements suggested an historical bias favoring the conversion of weak (A or T) alleles into strong (G or C) alleles. Here we found strong evidence in the derived allele frequency spectra of many of these 40 kb regions for ongoing weak-to-strong fixation bias. Comparison of the nucleotide composition at polymorphic loci to the composition at sites of fixed substitutions additionally reveals the signature of historical weak-to-strong fixation bias in a subset of these regions. Most of the regions with evidence for historical bias do not also have signatures of ongoing bias, suggesting that the evolutionary forces generating weak-to-strong bias are not constant over time. To investigate the role of selection in shaping these regions, we analyzed the spatial pattern of polymorphism in our samples. We found no significant evidence for selective sweeps, possibly because the signal of such sweeps has decayed beyond the power of our tests to detect them. Together, these results do not rule out functional roles for the observed changes in these regions—indeed there is good evidence that the first two are functional elements in humans—but they suggest that a fixation process (such as biased gene conversion) that is biased at the nucleotide level, but is otherwise selectively neutral, could be an important evolutionary force at play in them, both historically and at present.
The search for functional regions in the human genome, beyond the protein-coding portion, often relies on signals of conservation across species. The Human Accelerated Regions (HARs) are strongly conserved elements, ranging in size from 100–400 bp, that show an unexpected number of human-specific changes. This pattern suggests that HARs may be functional elements that have significantly changed during human evolution. To analyze the evolutionary forces that led these changes, we studied 40 kb neighborhoods of the top 49 HARs. We took advantage of recently developed DNA sequencing technology, coupled with methods to isolate genomic DNA for our target regions only, to determine the genotypes in 22 chromosomal samples. This polymorphism data showed no significant evidence for adaptive selective sweeps in HAR regions. By contrast, we found strong evidence for a nucleotide bias in the fixation of mutations from A or T to G or C basepairs. Our work reveals that this bias in the HAR neighborhoods is not just an historic phenomenon, but is ongoing in the present day human population. This finding adds credence to the possibility that non-selective forces, such as biased gene conversion, could have contributed to the evolution of several of these regions.