Thanks to the microarray technology, our understanding of transcriptome evolution at the genome level has been considerably advanced in the past decade. Yet, further investigation was challenged by several technical limitations of this technology. Recent innovation of next-generation sequencing, particularly the invention of RNA-seq technology, has shed insightful lights on resolving this problem. Though a number of statistical and computational methods have been developed to analyze RNA-seq data, the analytical framework specifically designed for evolutionary genomics remains an open question. In this article we develop a new method for estimating the genome expression distance from the RNA-seq data, which has explicit interpretations under the model of gene expression evolution. Moreover, this distance measure takes the data overdispersion, gene length variation, and sequencing depth variation into account so that it can be applied to multiple genomes from different species. Using mammalian RNA-seq data as example, we demonstrated that this expression distance is useful in phylogenomic analysis.
transcriptome evolution; RNA-seq; genome expression distance
Congenital heart disease (CHD) is one of the most prevalent developmental anomalies and the leading cause of noninfectious morbidity and mortality in newborns. Despite its prevalence and clinical significance, the etiology of CHD remains largely unknown. GATA4 is a highly conserved transcription factor that regulates a variety of physiological processes and has been extensively studied, particularly on its role in heart development. With the combination of TBX5 and MEF2C, GATA4 can reprogram postnatal fibroblasts into functional cardiomyocytes directly. In the past decade, a variety of GATA4 mutations were identified and these findings originally came from familial CHD pedigree studies. Given that familial and sporadic CHD cases allegedly share a basic genetic basis, we explore the GATA4 mutations in different types of CHD. In this study, via direct sequencing of the GATA4 coding region and exon-intron boundaries in 384 sporadic Chinese CHD patients, we identified 12 heterozygous non-synonymous mutations, among which 8 mutations were only found in CHD patients when compared with 957 controls. Six of these non-synonymous mutations have not been previously reported. Subsequent functional analyses revealed that the transcriptional activity, subcellular localization and DNA binding affinity of some mutant GATA4 proteins were significantly altered. Our results expand the spectrum of GATA4 mutations linked to cardiac defects. Together with the newly reported mutations, approximately 110 non-synonymous mutations have currently been identified in GATA4. Our future analysis will explore why the evolutionarily conserved GATA4 appears to be hypermutable.
One difficulty in conducting biologically meaningful dynamic analysis at the systems biology level is that in vivo system regulation is complex. Meanwhile, many kinetic rates are unknown, making global system analysis intractable in practice. In this article, we demonstrate a computational pipeline to help solve this problem, using the exocytotic process as an example. Exocytosis is an essential process in all eukaryotic cells that allows communication in cells through vesicles that contain a wide range of intracellular molecules. During this process a set of proteins called SNAREs acts as an engine in this vesicle-membrane fusion, by forming four-helical bundle complex between (membrane) target-specific and vesicle-specific SNAREs. As expected, the regulatory network for exocytosis is very complex. Based on the current understanding of the protein-protein interaction network related to exocytosis, we mathematically formulated the whole system, by the ordinary differential equations (ODE). We then applied a mathematical approach (called inverse problem) to estimating the kinetic parameters in the fundamental subsystem (without regulation) from limited in vitro experimental data, which fit well with the reports by the conventional assay. These estimates allowed us to conduct an efficient stability analysis under a specified parameter space for the exocytotic process with or without regulation. Finally, we discuss the potential of this approach to explain experimental observations and to make testable hypotheses for further experimentation.
Gene duplication and subsequent functional divergence especially expression divergence have been widely considered as main sources for evolutionary innovations. Many studies evidenced that genetic regulatory network evolved rapidly shortly after gene duplication, thus leading to accelerated expression divergence and diversification. However, little is known whether epigenetic factors have mediated the evolution of expression regulation since gene duplication. In this study, we conducted detailed analyses on yeast histone modification (HM), the major epigenetics type in this organism, as well as other available functional genomics data to address this issue.
Duplicate genes, on average, share more common HM-code patterns than random singleton pairs in their promoters and open reading frames (ORF). Though HM-code divergence between duplicates in both promoter and ORF regions increase with their sequence divergence, the HM-code in ORF region evolves slower than that in promoter region, probably owing to the functional constraints imposed on protein sequences. After excluding the confounding effect of sequence divergence (or evolutionary time), we found the evidence supporting the notion that in yeast, the HM-code may co-evolve with cis- and trans-regulatory factors. Moreover, we observed that deletion of some yeast HM-related enzymes increases the expression divergence between duplicate genes, yet the effect is lower than the case of transcription factor (TF) deletion or environmental stresses.
Our analyses demonstrate that after gene duplication, yeast histone modification profile between duplicates diverged with evolutionary time, similar to genetic regulatory elements. Moreover, we found the evidence of the co-evolution between genetic and epigenetic elements since gene duplication, together contributing to the expression divergence between duplicate genes.
Histone modification; Histone modification code divergence; Gene duplication; Expression divergence; Epigenetic divergence; cis-regulation; trans-regulation
Network motifs, recurring subnetwork patterns, provide significant insight into the biological networks which are believed to govern cellular processes.
We present a comparative network motif experimental approach, which helps to explain complex biological phenomena and increases the understanding of biological functions at the molecular level by exploring evolutionary design principles of network motifs.
Using this framework to analyze the SM (Sec1/Munc18)-SNARE (N-ethylmaleimide-sensitive factor activating protein receptor) system in exocytic membrane fusion in yeast and neurons, we find that the SM-SNARE network motifs of yeast and neurons show distinct dynamical behaviors. We identify the closed binding mode of neuronal SM (Munc18-1) and SNARE (syntaxin-1) as the key factor leading to mechanistic divergence of membrane fusion systems in yeast and neurons. We also predict that it underlies the conflicting observations in SM overexpression experiments. Furthermore, hypothesis-driven lipid mixing assays validated the prediction.
Therefore this study provides a new method to solve the discrepancies and to generalize the functional role of SM proteins.
Even though the genomes of many model species have already been sequenced, our knowledge of gene regulation in evolution is still very limited. One big obstacle is that it is hard to predict the target genes of transcriptional factors accurately from sequences. In this respect, microRNAs (miRNAs) are different from transcriptional factors, as target genes of miRNAs can be readily predicted from sequences. This feature of miRNAs offers an unprecedented vantage point for evolutionary analysis of gene regulation.
In this study, we analyzed a particular aspect of miRNA evolution, the differences in the “apparent repression effectiveness (ARE)” between human miRNAs of different conservational levels. ARE is a measure we designed to evaluate the repression effect of miRNAs on target genes based on publicly available gene expression data in normal tissues and miRNA targeting and expression data. We found that ARE values of more conserved miRNAs are significantly higher than those of less conserved miRNAs in general. We also found the gain in expression abundance and broadness of miRNAs in evolution contributed to the gain in ARE.
The ARE measure quantifies the repressive effects of miRNAs and enables us to study the influences of many factors on miRNA-mediated repression, such as conservational levels and expression levels of miRNAs. The gain in ARE can be explained by the existence of a trend of miRNAs in evolution to effectively control more target genes, which is beneficial to the miRNAs but not necessarily to the organism at all times. Our results from miRNAs gave us an insight of the complex interplay between regulators and target genes in evolution.
In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses. Encouraged by the development of the Minimum Information About a Microarray Experiment (MIAME) standard, we propose a similar roadmap for the development of a Minimal Information About a Phylogenetic Analysis (MIAPA) standard. Key in the successful development and implementation of such a standard will be broad participation by developers of phylogenetic analysis software, phylogenetic database developers, practitioners of phylogenomics, and journal editors.
The debate of genomic correlations between sequence conservation, protein connectivity, gene essentiality and gene expression, has generated a number of new hypotheses that are challenging the classical framework of molecular evolution. For instance, the translational selection hypothesis claims that the determination of the rate of protein evolution is the protein stability to avoid the misfolding toxicity. In this short article, we propose that gene pleiotropy, the capacity for affecting multiple phenotypes, may play a vital role in molecular evolution. We discuss several approaches to testing this hypothesis.
This article was reviewed by Dr Eugene Koonin, Dr Arcady Mushegian and Dr Claus Wilke.
The availability of genome and transcriptome sequences for a number of species permits the identification and characterization of conserved as well as divergent genes such as lineage-specific genes which have no detectable sequence similarity to genes from other lineages. While genes conserved among taxa provide insight into the core processes among species, lineage-specific genes provide insights into evolutionary processes and biological functions that are likely clade or species specific.
Comparative analyses using the Arabidopsis thaliana genome and sequences from 178 other species within the Plant Kingdom enabled the identification of 24,624 A. thaliana genes (91.7%) that were termed Evolutionary Conserved (EC) as defined by sequence similarity to a database entry as well as two sets of lineage-specific genes within A. thaliana. One of the A. thaliana lineage-specific gene sets share sequence similarity only to sequences from species within the Brassicaceae family and are termed Conserved Brassicaceae-Specific Genes (914, 3.4%, CBSG). The other set of A. thaliana lineage-specific genes, the Arabidopsis Lineage-Specific Genes (1,324, 4.9%, ALSG), lack sequence similarity to any sequence outside A. thaliana. While many CBSGs (76.7%) and ALSGs (52.9%) are transcribed, the majority of the CBSGs (76.1%) and ALSGs (94.4%) have no annotated function. Co-expression analysis indicated significant enrichment of the CBSGs and ALSGs in multiple functional categories suggesting their involvement in a wide range of biological functions. Subcellular localization prediction revealed that the CBSGs were significantly enriched in proteins targeted to the secretory pathway (412, 45.1%). Among the 107 putatively secreted CBSGs with known functions, 67 encode a putative pollen coat protein or cysteine-rich protein with sequence similarity to the S-locus cysteine-rich protein that is the pollen determinant controlling allele specific pollen rejection in self-incompatible Brassicaceae species. Overall, the ALSGs and CBSGs were more highly methylated in floral tissue compared to the ECs. Single Nucleotide Polymorphism (SNP) analysis showed an elevated ratio of non-synonymous to synonymous SNPs within the ALSGs (1.99) and CBSGs (1.65) relative to the EC set (0.92), mainly caused by an elevated number of non-synonymous SNPs, indicating that they are fast-evolving at the protein sequence level.
Our analyses suggest that while a significant fraction of the A. thaliana proteome is conserved within the Plant Kingdom, evolutionarily distinct sets of genes that may function in defining biological processes unique to these lineages have arisen within the Brassicaceae and A. thaliana.
Gene and genome duplication is the principle creative force in evolution. Recently, protein subcellular relocalization, or neolocalization was proposed as one of the mechanisms responsible for the retention of duplicated genes. This hypothesis received support from the analysis of yeast genomes, but has not been tested thoroughly on animal genomes. In order to evaluate the importance of subcellular relocalizations for retention of duplicated genes in animal genomes, we systematically analyzed nuclear encoded mitochondrial proteins in the human genome by reconstructing phylogenies of mitochondrial multigene families.
The 456 human mitochondrial proteins selected for this study were clustered into 305 gene families including 92 multigene families. Among the multigene families, 59 (64%) consisted of both mitochondrial and cytosolic (non-mitochondrial) proteins (mt-cy families) while the remaining 33 (36%) were composed of mitochondrial proteins (mt-mt families). Phylogenetic analyses of mt-cy families revealed three different scenarios of their neolocalization following gene duplication: 1) relocalization from mitochondria to cytosol, 2) from cytosol to mitochondria and 3) multiple subcellular relocalizations. The neolocalizations were most commonly enabled by the gain or loss of N-terminal mitochondrial targeting signals. The majority of detected subcellular relocalization events occurred early in animal evolution, preceding the evolution of tetrapods. Mt-mt protein families showed a somewhat different pattern, where gene duplication occurred more evenly in time. However, for both types of protein families, most duplication events appear to roughly coincide with two rounds of genome duplications early in vertebrate evolution. Finally, we evaluated the effects of inaccurate and incomplete annotation of mitochondrial proteins and found that our conclusion of the importance of subcellular relocalization after gene duplication on the genomic scale was robust to potential gene misannotation.
Our results suggest that protein subcellular relocalization is an important mechanism for the retention and gain of function of duplicated genes in animal genome evolution.
How gene duplication has influenced the evolution of gene networks is one of the core problems in evolution. Current duplication-divergence theories generally suggested that genes on the periphery of the networks were preferentially retained after gene duplication. However, previous studies were mostly based on gene networks in invertebrate species, and they had the inherent shortcoming of not being able to provide information on how the duplication-divergence process proceeded along the time axis during major speciation events.
In this study, we constructed a model system consisting of human G protein-coupled receptors (GPCRs) and their downstream genes in the GPCR pathways. These two groups of genes offered a natural partition of genes in the peripheral and the backbone layers of the network. Analysis of the age distributions of the duplication events in human GPCRs and "downstream genes" gene families indicated that they both experienced an explosive expansion at the time of early vertebrate emergence. However, we found only GPCR families saw a continued expansion after early vertebrates, mostly prominently in several small subfamilies of GPCRs involved in immune responses and sensory responses.
In general, in the human GPCR model system, we found that the position of a gene in the gene networks has significant influences on the likelihood of fixation of its duplicates. However, for a super gene family, the influence was not uniform among subfamilies. For super families, such as GPCRs, whose gene basis of expression diversity was well established at early vertebrates, continued expansions were mostly prominent in particular small subfamilies mainly involved in lineage-specific functions.
High gene numbers in plant genomes reflect polyploidy and major gene duplication events. Oryza sativa, cultivated rice, is a diploid monocotyledonous species with a ~390 Mb genome that has undergone segmental duplication of a substantial portion of its genome. This, coupled with other genetic events such as tandem duplications, has resulted in a substantial number of its genes, and resulting proteins, occurring in paralogous families.
Using a computational pipeline that utilizes Pfam and novel protein domains, we characterized paralogous families in rice and compared these with paralogous families in the model dicotyledonous diploid species, Arabidopsis thaliana. Arabidopsis, which has undergone genome duplication as well, has a substantially smaller genome (~120 Mb) and gene complement compared to rice. Overall, 53% and 68% of the non-transposable element-related rice and Arabidopsis proteins could be classified into paralogous protein families, respectively. Singleton and paralogous family genes differed substantially in their likelihood of encoding a protein of known or putative function; 26% and 66% of singleton genes compared to 73% and 96% of the paralogous family genes encode a known or putative protein in rice and Arabidopsis, respectively. Furthermore, a major skew in the distribution of specific gene function was observed; a total of 17 Gene Ontology categories in both rice and Arabidopsis were statistically significant in their differential distribution between paralogous family and singleton proteins. In contrast to mammalian organisms, we found that duplicated genes in rice and Arabidopsis tend to have more alternative splice forms. Using data from Massively Parallel Signature Sequencing, we show that a significant portion of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very young genes.
Collectively, these data suggest that while co-regulation and conserved function are present in some paralogous protein family members, evolutionary pressures have resulted in functional divergence with differential expression patterns.
Phylogenetically related miRNAs (miRNA families) convey important information of the function and evolution of miRNAs. Due to the special sequence features of miRNAs, pair-wise sequence identity between miRNA precursors alone is often inadequate for unequivocally judging the phylogenetic relationships between miRNAs. Most of the current methods for miRNA classification rely heavily on manual inspection and lack measurements of the reliability of the results.
In this study, we designed an analysis pipeline (the Phylogeny-Bootstrap-Cluster (PBC) pipeline) to identify miRNA families based on branch stability in the bootstrap trees derived from overlapping genome-wide miRNA sequence sets. We tested the PBC analysis pipeline with the miRNAs from six animal species, H. sapiens, M. musculus, G. gallus, D. rerio, D. melanogaster, and C. elegans. The resulting classification was compared with the miRNA families defined in miRBase. The two classifications were largely consistent.
The PBC analysis pipeline is an efficient method for classifying large numbers of heterogeneous miRNA sequences. It requires minimum human involvement and provides measurements of the reliability of the classification results.
Analysis of over 3,000 co-linear paired genes in rice shows more intron loss than intron gain following segmental duplication.
Introns are under less selection pressure than exons, and consequently, intronic sequences have a higher rate of gain and loss than exons. In a number of plant species, a large portion of the genome has been segmentally duplicated, giving rise to a large set of duplicated genes. The recent completion of the rice genome in which segmental duplication has been documented has allowed us to investigate intron evolution within rice, a diploid monocotyledonous species.
Analysis of segmental duplication in rice revealed that 159 Mb of the 371 Mb genome and 21,570 of the 43,719 non-transposable element-related genes were contained within a duplicated region. In these duplicated regions, 3,101 collinear paired genes were present. Using this set of segmentally duplicated genes, we investigated intron evolution from full-length cDNA-supported non-transposable element-related gene models of rice. Using gene pairs that have an ortholog in the dicotyledonous model species Arabidopsis thaliana, we identified more intron loss (49 introns within 35 gene pairs) than intron gain (5 introns within 5 gene pairs) following segmental duplication. We were unable to demonstrate preferential intron loss at the 3' end of genes as previously reported in mammalian genomes. However, we did find that the four nucleotides of exons that flank lost introns had less frequently used 4-mers.
We observed that intron evolution within rice following segmental duplication is largely dominated by intron loss. In two of the five cases of intron gain within segmentally duplicated genes, the gained sequences were similar to transposable elements.
The available web-based genome data and related resources provide great opportunities for biomedical scientists to identify functional elements in a particular genome region or to explore the evolutionary pattern of genome dynamics. Comparative genomics is an indispensable tool for achieving these goals. Because of the broad scope of comparative genomics, it is difficult to address all of its aspects in short survey. A few currently 'hot' topics have therefore been selected and a brief review of the availability of web-based databases software is given.
comparative genomics; software; web-based database
Many protein families have undergone functional divergence after gene duplications such that current subgroups of the family carry out overlapping but distinct biological roles. For the protein families with known functional subtypes (a functional split), we developed the software, SplitTester, to identify potential regions that are responsible for the observed distinct functional subtypes within the same protein family.
Our software, SplitTester, takes a multiple protein sequences alignment as input, generated from protein members of two subgroups with known functional divergence. SplitTester was designed to construct the neighbor joining tree (a split cluster) from variable-sized sliding windows across the alignment in a process called split-clustering. SplitTester identifies the regions, whose split cluster is consistent with the functional split, but may be inconsistent with the phylogeny of the protein family. We hypothesize that at least some number of these identified regions, which are not following a random mutation process, are responsible for the observed functional split. To test our method, we used reverse transcriptase from a group of Pseudoviridae retrotransposons: to identify residues specific for diverged primer recognition. Candidate regions were then mapped onto the three dimensional structures of reverse transcriptase. The locations of these amino acids within the enzyme are consistent with their biological roles.
SplitTester aims to identify specific domain sequences responsible for functional divergence of subgroups within a protein family. From the analysis of retroelements reverse transcriptase family, we successfully identified the regions splitting this family according to the primer specificity, implying their functions in the specific primer selection.
Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify interacting protein pairs experimentally in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific amino acids that contribute to the specificity and the strength of protein interactions is an important problem with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks.
In order to increase the power of predictive methods for protein-protein interaction sites, we have developed a consensus methodology for combining four different methods. These approaches include: data mining using Support Vector Machines, threading through protein structures, prediction of conserved residues on the protein surface by analysis of phylogenetic trees, and the Conservatism of Conservatism method of Mirny and Shakhnovich. Results obtained on a dataset of hydrolase-inhibitor complexes demonstrate that the combination of all four methods yield improved predictions over the individual methods.
We developed a consensus method for predicting protein-protein interface residues by combining sequence and structure-based methods. The success of our consensus approach suggests that similar methodologies can be developed to improve prediction accuracies for other bioinformatic problems.
Myb genes from Arabidopsis and rice were clustered into subgroups. The distribution of introns in the phylogenetic tree suggests that introns were inserted during evolution.
Myb proteins contain a conserved DNA-binding domain composed of one to four repeat motifs (referred to as R0R1R2R3); each repeat is approximately 50 amino acids in length, with regularly spaced tryptophan residues. Although the Myb proteins comprise one of the largest families of transcription factors in plants, little is known about the functions of most Myb genes. Here we use computational techniques to classify Myb genes on the basis of sequence similarity and gene structure, and to identify possible functional relationships among subgroups of Myb genes from Arabidopsis and rice (Oryza sativa L. ssp. indica).
This study analyzed 130 Myb genes from Arabidopsis and 85 from rice. The collected Myb proteins were clustered into subgroups based on sequence similarity and phylogeny. Interestingly, the exon-intron structure differed between subgroups, but was conserved in the same subgroup. Moreover, the Myb domains contained a significant excess of phase 1 and 2 introns, as well as an excess of nonsymmetric exons. Conserved motifs were detected in carboxy-terminal coding regions of Myb genes within subgroups. In contrast, no common regulatory motifs were identified in the noncoding regions. Additionally, some Myb genes with similar functions were clustered in the same subgroups.
The distribution of introns in the phylogenetic tree suggests that Myb domains originally were compact in size; introns were inserted and the splicing sites conserved during evolution. Conserved motifs identified in the carboxy-terminal regions are specific for Myb genes, and the identified Myb gene subgroups may reflect functional conservation.
In spite of only a 1-2 per cent genomic DNA sequence difference, humans and chimpanzees differ considerably in behaviour and cognition. Affymetrix microarray technology provides a novel approach to addressing a long-term debate on whether the difference between humans and chimpanzees results from the alteration of gene expressions. Here, we used several statistical methods (distance method, two-sample t-tests, regularised t-tests, ANOVA and bootstrapping) to detect the differential expression pattern between humans and great apes. Our analysis shows that the pattern we observed before is robust against various statistical methods; that is, the pronounced expression changes occurred on the human lineage after the split from chimpanzees, and that the dramatic brain expression alterations in humans may be mainly driven by a set of genes with increased expression (up-regulated) rather than decreased expression (down-regulated).
microarray; Affymetrix; differential expression; human evolution