Since its first release in 2001 as mainly a software package for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE) has gained many new functions that may be classified into six categories: 1) sequence retrieval, editing, manipulation, and conversion among more than 20 standard sequence formats including MEGA, NEXUS, PHYLIP, GenBank, and the new NeXML format for interoperability, 2) motif characterization and discovery functions such as position weight matrix and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions of codon adaptation index, effective number of codons, protein isoelectric point profiling, RNA and protein secondary structure prediction and calculation of minimum folding energy, and genomic skew plots with optimized window size, 4) molecular phylogenetics including sequence alignment, testing substitution saturation, distance-based, maximum parsimony, and maximum-likelihood methods for tree reconstructions, testing the molecular clock hypothesis with either a phylogeny or with relative-rate tests, dating gene duplication and speciation events, choosing the best-fit substitution models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative methods for continuous and discrete variables, and 6) graphic functions including secondary structure display, optimized skew plot, hydrophobicity plot, and many other plots of amino acid properties along a protein sequence, tree display and drawing by dragging nodes to each other, and visual searching of the maximum parsimony tree. DAMBE features a graphic, user-friendly, and intuitive interface and is freely available from http://dambe.bio.uottawa.ca (last accessed April 16, 2013).
bioinformatics; phylogenetics; dating; Gibbs sampler; motif discovery; secondary structure; codon usage; hidden Markov model; genomic analysis
Different patterns of strand asymmetry have been documented in a variety of prokaryotic genomes as well as mitochondrial genomes. Because different replication mechanisms often lead to different patterns of strand asymmetry, much can be learned of replication mechanisms by examining strand asymmetry. Here I summarize the diverse patterns of strand asymmetry among different taxonomic groups to suggest that (1) the single-origin replication may not be universal among bacterial species as the endosymbionts Wigglesworthia glossinidia, Wolbachia species, cyanobacterium Synechocystis 6803 and Mycoplasma pulmonis genomes all exhibit strand asymmetry patterns consistent with the multiple origins of replication, (2) different replication origins in some archaeal genomes leave quite different patterns of strand asymmetry, suggesting that different replication origins in the same genome may be differentially used, (3) mitochondrial genomes from representative vertebrate species share one strand asymmetry pattern consistent with the strand-displacement replication documented in mammalian mtDNA, suggesting that the mtDNA replication mechanism in mammals may be shared among all vertebrate species, and (4) mitochondrial genomes from primitive forms of metazoans such as the sponge and hydra (representing Porifera and Cnidaria, respectively), as well as those from plants, have strand asymmetry patterns similar to single-origin or multi-origin replications observed in prokaryotes and are drastically different from mitochondrial genomes from other metazoans. This may explain why sponge and hydra mitochondrial genomes, as well as plant mitochondrial genomes, evolves much slower than those from other metazoans.
Archaea; DNA replication; deamination; GC skew; mitochondria; mutation; origin of replication; selection.
In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input–output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.
Data standards; evolutionary informatics; interoperability; phyloinformatics; semantic web; syntax format
Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.
Accurate and efficient splicing is of crucial importance for highly-transcribed intron-containing genes (ICGs) in rapidly replicating unicellular eukaryotes such as the budding yeast Saccharomyces cerevisiae. We characterize the 5′ and 3′ splice sites (ss) by position weight matrix scores (PWMSs), which is the highest for the consensus sequence and the lowest for splice sites differing most from the consensus sequence and used PWMS as a proxy for splicing strength. HAC1, which is known to be spliced by a nonspliceosomal mechanism, has the most negative PWMS for both its 5′ ss and 3′ ss. Several genes under strong splicing regulation and requiring additional splicing factors for their splicing also have small or negative PWMS values. Splicing strength is higher for highly transcribed ICGs than for lowly transcribed ICGs and higher for transcripts that bind strongly to spliceosomes than those that bind weakly. The 3′ splice site features a prominent poly-U tract before the 3′AG. Our results suggest the potential of using PWMS as a screening tool for ICGs that are either spliced by a nonspliceosome mechanism or under strong splicing regulation in yeast and other fungal species.
The 5′-UTR serves as the loading dock for ribosomes during translation initiation and is the key site for translation regulation. Many genes in the yeast Saccharomyces cerevisiae contain poly(A) tracts in their 5′-UTRs. We studied these pre-AUG poly(A) tracts in a set of 3274 recently identified 5′-UTRs in the yeast to characterize their effect on in vivo protein abundance, ribosomal density, and protein synthesis rate in the yeast. The protein abundance and the protein synthesis rate increase with the length of the poly(A), but exhibit a dramatic decrease when the poly(A) length is ≥12. The ribosomal density also reaches the lowest level when the poly(A) length is ≥12. This supports the hypothesis that a pre-AUG poly(A) tract can bind to translation initiation factors to enhance translation initiation, but a long (≥12) pre-AUG poly(A) tract will bind to Pab1p, whose binding size is 12 consecutive A residues in yeast, resulting in repression of translation. The hypothesis explains why a long pre-AUG poly(A) leads to more efficient translation initiation than a short one when PABP is absent, and why pre-AUG poly(A) is short in the early genes but long in the late genes of vaccinia virus.
The ratio of the number of nonsynonymous substitutions per site (Ka) over the number of synonymous substitutions per site (Ks) has often been used to detect positive selection. Investigators now commonly generate Ka/Ks ratio profiles in a sliding window to look for peaks and valleys in order to identify regions under positive selection. Here we show that the interpretation of peaks in the Ka/Ks profile as evidence for positive selection can be misleading. Genie regions with Ka/Ks > 1 in the MRG gene family, previously claimed to be under positive selection, are associated with a high frequency of polar amino acids with a high mutability. This association between an increased Ka and a high proportion of polar amino acids appears general and not limited to the MRG gene family or the sliding-window approach. For example, the sites detected to be under positive selection in the HIV1 protein-coding genes with a high posterior probability turn out to be mostly occupied by polar amino acids. These findings caution against invoking positive selection from Ka/Ks ratios and highlight the need for considering biochemical properties of the protein domains showing high Ka/Ks ratios. In short, a high Ka/Ks ratio may arise from the intrinsic properties of amino acids instead of from extrinsic positive selection.
Despite its poorly adapted codon usage, HIV-1 replicates and is expressed extremely well in human host cells. HIV-1 has recently been shown to package non-lysyl transfer RNAs (tRNAs) in addition to the tRNALys needed for priming reverse transcription and integration of the HIV-1 genome. By comparing the codon usage of HIV-1 genes with that of its human host, we found that tRNAs decoding codons that are highly used by HIV-1 but avoided by its host are overrepresented in HIV-1 virions. In particular, tRNAs decoding A-ending codons, required for the expression of HIV's A-rich genome, are highly enriched. Because the affinity of Gag-Pol for all tRNAs is nonspecific, HIV packaging is most likely passive and reflects the tRNA pool at the time of viral particle formation. Codon usage of HIV-1 early genes is similar to that of highly expressed host genes, but codon usage of HIV-1 late genes was better adapted to the selectively enriched tRNA pool, suggesting that alterations in the tRNA pool are induced late in viral infection. If HIV-1 genes are adapting to an altered tRNA pool, codon adaptation of HIV-1 may be better than previously thought.
HIV-1; tRNA; codon usage; translation efficiency; codon–anticodon adaptation
In spite of extensive research on the effect of mutation and selection on codon usage, a general model of codon usage bias due to mutational bias has been lacking. Because most amino acids allow synonymous GC content changing substitutions in the third codon position, the overall GC bias of a genome or genomic region is highly correlated with GC3, a measure of third position GC content. For individual amino acids as well, G/C ending codons usage generally increases with increasing GC bias and decreases with increasing AT bias. Arginine and leucine, amino acids that allow GC-changing synonymous substitutions in the first and third codon positions, have codons which may be expected to show different usage patterns.
In analyzing codon usage bias in hundreds of prokaryotic and plant genomes and in human genes, we find that two G-ending codons, AGG (arginine) and TTG (leucine), unlike all other G/C-ending codons, show overall usage that decreases with increasing GC bias, contrary to the usual expectation that G/C-ending codon usage should increase with increasing genomic GC bias. Moreover, the usage of some codons appears nonlinear, even nonmonotone, as a function of GC bias. To explain these observations, we propose a continuous-time Markov chain model of GC-biased synonymous substitution. This model correctly predicts the qualitative usage patterns of all codons, including nonlinear codon usage in isoleucine, arginine and leucine. The model accounts for 72%, 64% and 52% of the observed variability of codon usage in prokaryotes, plants and human respectively. When codons are grouped based on common GC content, 87%, 80% and 68% of the variation in usage is explained for prokaryotes, plants and human respectively.
The model clarifies the sometimes-counterintuitive effects that GC mutational bias can have on codon usage, quantifies the influence of GC mutational bias and provides a natural null model relative to which other influences on codon bias may be measured.
Ovarian follicle development is a complex process. Paracrine interactions between somatic and germ cells are critical for normal follicular development and oocyte maturation. Studies have suggested that the health and function of the granulosa and cumulus cells may be reflective of the health status of the enclosed oocyte. The objective of the present study is to assess, using an in vivo immature rat model, gene expression profile in granulosa cells, which may be linked to the developmental competence of the oocyte. We hypothesized that expression of specific genes in granulosa cells may be correlated with the developmental competence of the oocyte.
Immature rats were injected with eCG and 24 h thereafter with anti-eCG antibody to induce follicular atresia or with pre-immune serum to stimulate follicle development. A high percentage (30-50%, normal developmental competence, NDC) of oocytes from eCG/pre-immune serum group developed to term after embryo transfer compared to those from eCG/anti-eCG (0%, poor developmental competence, PDC). Gene expression profiles of mural granulosa cells from the above oocyte-collected follicles were assessed by Affymetrix rat whole genome array.
The result showed that twelve genes were up-regulated, while one gene was down-regulated more than 1.5 folds in the NDC group compared with those in the PDC group. Gene ontology classification showed that the up-regulated genes included lysyl oxidase (Lox) and nerve growth factor receptor associated protein 1 (Ngfrap1), which are important in the regulation of protein-lysine 6-oxidase activity, and in apoptosis induction, respectively. The down-regulated genes included glycoprotein-4-beta galactosyltransferase 2 (Ggbt2), which is involved in the regulation of extracellular matrix organization and biogenesis.
The data in the present study demonstrate a close association between specific gene expression in mural granulosa cells and the developmental competence of oocytes. This finding suggests that the most differentially expressed gene, lysyl oxidase, may be a candidate biomarker of oocyte health and useful for the selection of good quality oocytes for assisted reproduction.
Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted.
In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABAA gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays.
Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.
The objective of this work was to investigate the hypothesis that eukaryotic Internal Ribosome Entry Sites (IRES) lack secondary structure and to examine the generality of the hypothesis.
IRESs of the yeast and the fruit fly are located in the 5′UTR immediately upstream of the initiation codon. The minimum folding energy (MFE) of 60 nt RNA segments immediately upstream of the initiation codons was calculated as a proxy of secondary structure stability. MFE of the reverse complements of these 60 nt segments was also calculated. The relationship between MFE and empirically determined IRES activity was investigated to test the hypothesis that strong IRES activity is associated with weak secondary structure. We show that IRES activity in the yeast and the fruit fly correlates strongly with the structural stability, with highest IRES activity found in RNA segments that exhibit the weakest secondary structure.
We found that a subset of eukaryotic IRESs exhibits very low secondary structure in the 5′-UTR sequences immediately upstream of the initiation codon. The consistency in results between the yeast and the fruit fly suggests a possible shared mechanism of cap-independent translation initiation that relies on an unstructured RNA segment.
Maelstrom (MAEL) plays a crucial role in a recently-discovered piRNA pathway; however its specific function remains unknown. Here a novel MAEL-specific domain characterized by a set of conserved residues (Glu-His-His-Cys-His-Cys, EHHCHC) was identified in a broad range of species including vertebrates, sea squirts, insects, nematodes, and protists. It exhibits ancient lineage-specific expansions in several species, however, appears to be lost in all examined teleost fish species. Functional involvement of MAEL domains in DNA- and RNA-related processes was further revealed by its association with HMG, SR-25-like and HDAC_interact domains. A distant similarity to the DnaQ-H 3'–5' exonuclease family with the RNase H fold was discovered based on the evidence that all MAEL domains adopt the canonical RNase H fold; and several protist MAEL domains contain the conserved 3'–5' exonuclease active site residues (Asp-Glu-Asp-His-Asp, DEDHD). This evolutionary link together with structural examinations leads to a hypothesis that MAEL domains may have a potential nuclease activity or RNA-binding ability that may be implicated in piRNA biogenesis. The observed transition of two sets of characteristic residues between the ancestral DnaQ-H and the descendent MAEL domains may suggest a new mode for protein function evolution called "active site switch", in which the protist MAEL homologues are the likely evolutionary intermediates due to harboring the specific characteristics of both 3'–5' exonuclease and MAEL domains.
This article was reviewed by L Aravind, Wing-Cheong Wong and Frank Eisenhaber. For the full reviews, please go to the Reviewers' Comments section.
It is known that primary sequences of enzymes involved in sterol biosynthesis are well conserved in organisms that produce sterols de novo. However, we provide evidence for a preservation of the corresponding genes in two animals unable to synthesize cholesterol (auxotrophs): Drosophila melanogaster and Caenorhabditis elegans.
We have been able to detect bona fide orthologs of several ERG genes in both organisms using a series of complementary approaches. We have detected strong sequence divergence between the orthologs of the nematode and of the fruitfly; they are also very divergent with respect to the orthologs in organisms able to synthesize sterols de novo (prototrophs). Interestingly, the orthologs in both the nematode and the fruitfly are still under selective pressure. It is possible that these genes, which are not involved in cholesterol synthesis anymore, have been recruited to perform different new functions. We propose a more parsimonious way to explain their accelerated evolution and subsequent stabilization. The products of ERG genes in prototrophs might be involved in several biological roles, in addition to sterol synthesis. In the case of the nematode and the fruitfly, the relevant genes would have lost their ancestral function in cholesterogenesis but would have retained the other function(s), which keep them under pressure.
By exploiting microarray data we have noticed a strong expressional correlation between the orthologs of ERG24 and ERG25 in D. melanogaster and genes encoding factors involved in intracellular protein trafficking and folding and with Start1 involved in ecdysteroid synthesis. These potential functional connections are worth being explored not only in Drosophila, but also in Caenorhabditis as well as in sterol prototrophs.
Fungal and animal mitochondrial genomes typically have one tRNA for each synonymous codon family. The codon-anticodon adaptation hypothesis predicts that the wobble nucleotide of a tRNA anticodon should evolve towards maximizing Watson-Crick base pairing with the most frequently used codon within each synonymous codon family, whereas the wobble versatility hypothesis argues that the nucleotide at the wobble site should be occupied by a nucleotide most versatile in wobble pairing, i.e., the tRNA wobble nucleotide should be G for NNY codon families, and U for NNR and NNN codon families (where Y stands for C or U, R for A or G and N for any nucleotide).
We here integrate these two traditional hypotheses on tRNA anticodons into a unified model based on an analysis of the wobble costs associated with different wobble base pairs. This novel approach allows the relative cost of wobble pairing to be qualitatively evaluated. A comprehensive study of 36 fungal genomes suggests very different costs between two kinds of U:G wobble pairs, i.e., (1) between a G at the wobble site of a tRNA anticodon and a U at the third codon position (designated MU3:G) and (2) between a U at the wobble site of a tRNA anticodon and a G at the third codon position (designated MG3:U).
In general, MU3:G is much smaller than MG3:U, suggesting no selection against U-ending codons in NNY codon families with a wobble G in the tRNA anticodon but strong selection against G-ending codons in NNR codon families with a wobble U at the tRNA anticodon. This finding resolves several puzzling observations in fungal genomics and corroborates previous studies showing that U3:G wobble is energetically more favorable than G3:U wobble.
Normalization is essential in dual-labelled microarray data analysis to remove non-biological variations and systematic biases. Many normalization methods have been used to remove such biases within slides (Global, Lowess) and across slides (Scale, Quantile and VSN). However, all these popular approaches have critical assumptions about data distribution, which is often not valid in practice.
In this study, we propose a novel assumption-free normalization method based on the Generalized Procrustes Analysis (GPA) algorithm. Using experimental and simulated normal microarray data and boutique array data, we systemically evaluate the ability of the GPA method in normalization compared with six other popular normalization methods including Global, Lowess, Scale, Quantile, VSN, and one boutique array-specific housekeeping gene method. The assessment of these methods is based on three different empirical criteria: across-slide variability, the Kolmogorov-Smirnov (K-S) statistic and the mean square error (MSE). Compared with other methods, the GPA method performs effectively and consistently better in reducing across-slide variability and removing systematic bias.
The GPA method is an effective normalization approach for microarray data analysis. In particular, it is free from the statistical and biological assumptions inherent in other normalization methods that are often difficult to validate. Therefore, the GPA method has a major advantage in that it can be applied to diverse types of array sets, especially to the boutique array where the majority of genes may be differentially expressed.
The reconstruction of phylogenies is becoming an increasingly simple activity. This is mainly due to two reasons: the democratization of computing power and the increased availability of sophisticated yet user-friendly software. This review describes some of the latest additions to the phylogenetic toolbox, along with some of their theoretical and practical limitations. It is shown that Bayesian methods are under heavy development, as they offer the possibility to solve a number of long-standing issues and to integrate several steps of the phylogenetic analyses into a single framework. Specific topics include not only phylogenetic reconstruction, but also the comparison of phylogenies, the detection of adaptive evolution, and the estimation of divergence times between species.
Codon adaptation index is a widely used index for characterizing gene expression in general and translation efficiency in particular. Current computational implementations have a number of problems leading to various systematic biases. I illustrate these problems and provide a better computer implementation to solve these problems. The improved CAI can predict protein production better than CAI from other commonly used implementations.
Codon usage bias; translation elongation; gene expression; tRNA
I show several types of topological biases in distance-based methods that use the least-squares method to evaluate branch lengths and the minimum evolution (ME) or the Fitch-Margoliash (FM) criterion to choose the best tree. For a 6-species tree, there are two tree shapes, one with three cherries (a cherry is a pair of adjacent leaves descending from the most recent common ancestor), and the other with two. When genetic distances are underestimated, the 3-cherry tree shape is favored with either the ME or FM criterion. When the genetic distances are overestimated, the ME criterion favors the 2-cherry tree, but the direction of bias with the FM criterion depends on whether negative branches are allowed, i.e. allowing negative branches favors the 3-cherry tree shape but disallowing negative branches favors the 2-cherry tree shape. The extent of the bias is explored by computer simulation of sequence evolution.
topological bias; minimum evolution; least-squares method; Fitch-Margoliash
The strand-biased mutation spectrum in vertebrate mitochondrial genomes results in an AC-rich L-strand and a GT-rich H-strand. Because the L-strand is the sense strand of 12 protein-coding genes out of the 13, the third codon position is overall strongly AC-biased. The wobble site of the anticodon of the 22 mitochondrial tRNAs is either U or G to pair with the most abundant synonymous codon, with only one exception. The wobble site of Met-tRNA is C instead of U, forming the Watson-Crick match with AUG instead of AUA, the latter being much more frequent than the former. This has been attributed to a compromise between translation initiation and elongation; i.e., AUG is not only a methionine codon, but also an initiation codon, and an anticodon matching AUG will increase the initiation rate. However, such an anticodon would impose selection against the use of AUA codons because AUA needs to be wobble-translated. According to this translation conflict hypothesis, AUA should be used relatively less frequently compared to UUA in the UUR codon family. A comprehensive analysis of mitochondrial genomes from a variety of vertebrate species revealed a general deficiency of AUA codons relative to UUA codons. In contrast, urochordate mitochondrial genomes with two tRNAMet genes with CAU and UAU anticodons exhibit increased AUA codon usage. Furthermore, six bivalve mitochondrial genomes with both of their tRNA-Met genes with a CAU anticodon have reduced AUA usage relative to three other bivalve mitochondrial genomes with one of their two tRNA-Met genes having a CAU anticodon and the other having a UAU anticodon. We conclude that the translation conflict hypothesis is empirically supported, and our results highlight the fine details of selection in shaping molecular evolution.
The optimal context for translation initiation in mammalian species is GCCRCCaugG (where R = purine and “aug” is the initiation codon), with the -3R and +4G being particularly important. The presence of +4G has been interpreted as necessary for efficient translation initiation. Accumulated experimental and bioinformatic evidence has suggested an alternative explanation based on amino acid constraint on the second codon, i.e., amino acid Ala or Gly are needed as the second amino acid in the nascent peptide for the cleavage of the initiator Met, and the consequent overuse of Ala and Gly codons (GCN and GGN) leads to the +4G consensus. I performed a critical test of these alternative hypotheses on +4G based on 34169 human protein-coding genes and published gene expression data. The result shows that the prevalence of +4G is not related to translation initiation. Among the five G-starting codons, only alanine codons (GCN), and glycine codons (GGN) to a much smaller extent, are overrepresented at the second codon, whereas the other three codons are not overrepresented. While highly expressed genes have more +4G than lowly expressed genes, the difference is caused by GCN and GGN codons at the second codon. These results are inconsistent with +4G being needed for efficient translation initiation, but consistent with the proposal of amino acid constraint hypothesis.
MBEToolbox is an extensible MATLAB-based software package for analysis of DNA and protein sequences. MBEToolbox version 2.0 includes enhanced functions for phylogenetic analyses by the maximum likelihood method. For example, it is capable of estimating the synonymous and nonsynonymous substitution rates using a novel or several known codon substitution models. MBEToolbox 2.0 introduces new functions for estimating site-specific evolutionary rates by using a maximum likelihood method or an empirical Bayesian method. It also incorporates several different methods for recombination detection. Multi-platform versions of the software are freely available at http://www.bioinformatics.org/mbetoolbox/.
MBEToolbox; MATLAB; Molecular Evolution; Computer software
Bacterial genomes differ dramatically in AT%. We have developed a model to show that the genomic AT% in rapidly replicating bacterial species can be used as an index of the availability of nucleotides A and T for DNA replication in cellular medium. This index is then used to (1) study the evolution and adaptation of the bacteriophage genomic AT% in response to the differential nucleotide availability of the host and (2) test the prediction that double-stranded DNA (dsDNA) phage should exhibit better adaptation than single-stranded DNA (ssDNA) phage because the rate of spontaneous deamination, which leads to C→T or C→U mutations depending on whether C is methylated or not, is about 100-fold greater in ssDNA than in dsDNA.
We retrieved 79 dsDNA phage and 27 ssDNA phage genomes together with their host genomic sequences. The dsDNA phages have their genomic AT% better adapted to the host genomic AT% than ssDNA phage. The poorer adaptation of the ssDNA phage can be partially accounted for by the C→T(U) mutations mediated by the spontaneous deamination. For ssDNA phage, the genomic A% is more strongly correlated with their host genomic AT% than the genomic T%.
A significant fraction of variation in the genomic AT% in the dsDNA phage, and that in the genomic A% and T% of the ssDNA phage, can be explained by the difference in selection and mutation between them.
MATLAB is a high-performance language for technical computing, integrating computation, visualization, and programming in an easy-to-use environment. It has been widely used in many areas, such as mathematics and computation, algorithm development, data acquisition, modeling, simulation, and scientific and engineering graphics. However, few functions are freely available in MATLAB to perform the sequence data analyses specifically required for molecular biology and evolution.
We have developed a MATLAB toolbox, called MBEToolbox, aimed at filling this gap by offering efficient implementations of the most needed functions in molecular biology and evolution. It can be used to manipulate aligned sequences, calculate evolutionary distances, estimate synonymous and nonsynonymous substitution rates, and infer phylogenetic trees. Moreover, it provides an extensible, functional framework for users with more specialized requirements to explore and analyze aligned nucleotide or protein sequences from an evolutionary perspective. The full functions in the toolbox are accessible through the command-line for seasoned MATLAB users. A graphical user interface, that may be especially useful for non-specialist end users, is also provided.
MBEToolbox is a useful tool that can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. The software is publicly available at and .