Transcription factors (TFs) form large paralogous gene families and have complex evolutionary histories. Here, we ask whether putative orthologs of TFs, from bidirectional best BLAST hits (BBHs), are evolutionary orthologs with conserved functions. We show that BBHs of TFs from distantly related bacteria are usually not evolutionary orthologs. Furthermore, the false orthologs usually respond to different signals and regulate distinct pathways, while the few BBHs that are evolutionary orthologs do have conserved functions. To test the conservation of regulatory interactions, we analyze expression patterns. We find that regulatory relationships between TFs and their regulated genes are usually not conserved for BBHs in Escherichia coli K12 and Bacillus subtilis. Even in the much more closely related bacteria Vibrio cholerae and Shewanella oneidensis MR-1, predicting regulation from E. coli BBHs has high error rates. Using gene–regulon correlations, we identify genes whose expression pattern differs between E. coli and S. oneidensis. Using literature searches and sequence analysis, we show that these changes in expression patterns reflect changes in gene regulation, even for evolutionary orthologs. We conclude that the evolution of bacterial regulation should be analyzed with phylogenetic trees, rather than BBHs, and that bacterial regulatory networks evolve more rapidly than previously thought.
Living organisms use transcription factors (TFs) to control the production of proteins. For example, the bacterium E. coli contains a TF that prevents it from making enzymes that degrade lactose when lactose is absent. Bacterial genomes encode a huge diversity of TFs, and except in a few well-studied organisms, the function of these TFs is not known. To predict the function of a TF, biologists often search for a similar TF, from another organism, that has been characterized. It is generally believed that orthologous TFs—TFs that are derived from the organisms' common ancestor—will have conserved functions. The authors show that a commonly used method to identify orthologous TFs gives misleading results when applied to distantly related bacteria: the “orthologous” TFs are evolutionarily distant, they sense different signals, and they regulate different pathways. Biologists often predict, more specifically, that orthologous TFs will regulate orthologous genes. However, the authors show that even in more closely related bacteria, where the orthologous TFs do have conserved functions, these specific predictions are often incorrect. It seems that gene regulation in bacteria evolves rapidly, and it will be difficult to predict regulation in diverse bacteria from our knowledge of a few well-studied bacteria.
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
orthology; bidirectional best hit; genome comparison; synteny
Bidirectional best hits (BBH), which entails identifying the pairs of genes in two different genomes that are more similar to each other than either is to any other gene in the other genome, is a simple and widely used method to infer orthology. A recent study has analyzed the link between BBH and orthology in bacteria and archaea and concluded that, given the very high consistency in BBH they observed among triplets of neighboring genes, a high proportion of BBH are likely to be bona fide orthologs. However, limited by their analysis setup, the previous study could not easily test the reverse question: which proportion of orthologs are BBH? In this follow-up study, we consider this question in theory and answer it based on conceptual arguments, simulated data, and real biological data from all three domains of life. Our analyses corroborate the findings of the previous study, but also show that because of the high rate of gene duplication in plants and animals, as much as 60% of orthologous relations are missed by the BBH criterion.
orthology; bidirectional best hit; reciprocal best hit; comparative genomics; evolutionary relationships; in-paralogy
Conserved gene clusters are groups of genes that are located close to one another in the genomes of several species. They tend to code for proteins that have a functional interaction. The identification of conserved gene clusters is an important step towards understanding genome evolution and predicting gene function.
In this paper, we propose a novel pairwise gene cluster model that combines the notion of bidirectional best hits with the r-window model introduced in 2003 by Durand and Sankoff. The bidirectional best hit (BBH) constraint removes the need to specify the minimum number of shared genes in the r-window model and improves the relevance of the results. We design a subquadratic time algorithm to compute the set of BBH r-window gene clusters efficiently.
We apply our cluster model to the comparative analysis of E. coli K-12 and B. subtilis and perform an extensive comparison between our new model and the gene teams model developed by Bergeron et al. As compared to the gene teams model, our new cluster model has a slightly lower recall but a higher precision at all levels of recall when the results were ranked using statistical tests. An analysis of the most significant BBH r-window gene cluster show that they correspond to known operons.
Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.
Mining gene patterns that are common to multiple genomes is an important biological problem, which can lead us to novel biological insights. When family classification of genes is available, this problem is similar to the pattern mining problem in the data mining community. However, when family classification information is not available, mining gene patterns is a challenging problem. There are several well developed algorithms for predicting gene patterns in a pair of genomes, such as FISH and DAGchainer. These algorithms use the optimization problem formulation which is solved using the dynamic programming technique. Unfortunately, extending these algorithms to multiple genome cases is not trivial due to the rapid increase in time and space complexity.
In this paper, we propose a novel algorithm for mining gene patterns in more than two prokaryote genomes using interchangeable sets. The basic idea is to extend the pattern mining technique from the data mining community to handle the situation where family classification information is not available using interchangeable sets. In an experiment with four newly sequenced genomes (where the gene annotation is unavailable), we show that the gene pattern can capture important biological information. To examine the effectiveness of gene patterns further, we propose an ortholog prediction method based on our gene pattern mining algorithm and compare our method to the bi-directional best hit (BBH) technique in terms of COG orthologous gene classification information. The experiment show that our algorithm achieves a 3% increase in recall compared to BBH without sacrificing the precision of ortholog detection.
The discovered gene patterns can be used for the detecting of ortholog and genes that collaborate for a common biological function.
The unparalleled growth in the availability of genomic data offers both a challenge to develop orthology detection methods that are simultaneously accurate and high throughput and an opportunity to improve orthology detection by leveraging evolutionary evidence in the accumulated sequenced genomes. Here, we report a novel orthology detection method, termed QuartetS, that exploits evolutionary evidence in a computationally efficient manner. Based on the well-established evolutionary concept that gene duplication events can be used to discriminate homologous genes, QuartetS uses an approximate phylogenetic analysis of quartet gene trees to infer the occurrence of duplication events and discriminate paralogous from orthologous genes. We used function- and phylogeny-based metrics to perform a large-scale, systematic comparison of the orthology predictions of QuartetS with those of four other methods [bi-directional best hit (BBH), outgroup, OMA and QuartetS-C (QuartetS followed by clustering)], involving 624 bacterial genomes and >2 million genes. We found that QuartetS slightly, but consistently, outperformed the highly specific OMA method and that, while consuming only 0.5% additional computational time, QuartetS predicted 50% more orthologs with a 50% lower false positive rate than the widely used BBH method. We conclude that, for large-scale phylogenetic and functional analysis, QuartetS and QuartetS-C should be preferred, respectively, in applications where high accuracy and high throughput are required.
The type IV secretion system (T4SS) can be classified as a large family of macromolecule transporter systems, divided into three recognized sub-families, according to the well-known functions. The major sub-family is the conjugation system, which allows transfer of genetic material, such as a nucleoprotein, via cell contact among bacteria. Also, the conjugation system can transfer genetic material from bacteria to eukaryotic cells; such is the case with the T-DNA transfer of Agrobacterium tumefaciens to host plant cells. The system of effector protein transport constitutes the second sub-family, and the third one corresponds to the DNA uptake/release system. Genome analyses have revealed numerous T4SS in Bacteria and Archaea. The purpose of this work was to organize, classify, and integrate the T4SS data into a single database, called AtlasT4SS - the first public database devoted exclusively to this prokaryotic secretion system.
The AtlasT4SS is a manual curated database that describes a large number of proteins related to the type IV secretion system reported so far in Gram-negative and Gram-positive bacteria, as well as in Archaea. The database was created using the RDBMS MySQL and the Catalyst Framework based in the Perl programming language and using the Model-View-Controller (MVC) design pattern for Web. The current version holds a comprehensive collection of 1,617 T4SS proteins from 58 Bacteria (49 Gram-negative and 9 Gram-Positive), one Archaea and 11 plasmids. By applying the bi-directional best hit (BBH) relationship in pairwise genome comparison, it was possible to obtain a core set of 134 clusters of orthologous genes encoding T4SS proteins.
In our database we present one way of classifying orthologous groups of T4SSs in a hierarchical classification scheme with three levels. The first level comprises four classes that are based on the organization of genetic determinants, shared homologies, and evolutionary relationships: (i) F-T4SS, (ii) P-T4SS, (iii) I-T4SS, and (iv) GI-T4SS. The second level designates a specific well-known protein families otherwise an uncharacterized protein family. Finally, in the third level, each protein of an ortholog cluster is classified according to its involvement in a specific cellular process. AtlasT4SS database is open access and is available at http://www.t4ss.lncc.br.
The flagellum of Salmonella typhimurium is assembled in stages, and the negative regulatory protein, FlgM, is able to sense the completion of an intermediate stage of assembly, the basal body-hook (BBH) structure. Mutations in steps leading to the formation of the BBH structure do not express the flagellar filament structural genes, fliC and fljB, due to negative regulation by FlgM (K. L. Gillen and K. T. Hughes, J. Bacteriol. 173:6453-6459, 1991). We have discovered another novel regulatory gene, flk, which appears to sense the completion of another assembly stage in the flagellar morphogenic pathway just prior to BBH formation: the completion of the P- and L-rings. Cells that are unable to assemble the L- or P-rings do not express the flagellin structural genes. Mutations by insertional inactivation in either the flk or flgM locus allow expression of the fljB flagellin structural gene in strains defective in flagellar P- and L-ring assembly. Mutations in the flgM gene, but not mutations in the flk gene, allow expression of the fljB gene in strains defective in all of the steps leading to BBH formation. The flk gene was mapped to min 52 of the S. typhimurium linkage map between the pdxB and fabB loci. A null allele of flk was complemented in trans by a flk+ allele present in a multicopy pBR-based plasmid. DNA sequence analysis of the flk gene has revealed it to be identical to a gene of Escherichia coli of unknown function which has an overlapping, divergent promoter with the pdxB gene promoter (P. A. Schoenlein, B. B. Roa, and M. E. Winkler, J. Bacteriol. 174:6256-6263, 1992). An open reading frame of 333 amino acids corresponding to the flk gene product of S. typhimurium and 331 amino acids from the E. coli sequence was identified. The transcriptional start site of the S. typhimurium flk gene was determined and transcription of the flk gene was independent of the FlhDC and sigma28 flagellar transcription factors. The Flk protein observed in a T7 RNA polymerase-mediated expression system showed an apparent molecular mass of 35 kDa, slightly smaller than the predicted size of 37 kDa. The predicted structure of Flk is a mostly hydrophilic protein with a very C-terminal membrane-spanning segment preceded by positively charged amino acids. This finding predicts Flk to be inserted into the cytoplasmic membrane facing inside the cytoplasm.
The identification of orthologous genes shared by multiple genomes plays an important role in evolutionary studies and gene functional analyses. Based on a recently developed accurate tool, called MSOAR 2.0, for ortholog assignment between a pair of closely related genomes based on genome rearrangement, we present a new system MultiMSOAR 2.0, to identify ortholog groups among multiple genomes in this paper. In the system, we construct gene families for all the genomes using sequence similarity search and clustering, run MSOAR 2.0 for all pairs of genomes to obtain the pairwise orthology relationship, and partition each gene family into a set of disjoint sets of orthologous genes (called super ortholog groups or SOGs) such that each SOG contains at most one gene from each genome. For each such SOG, we label the leaves of the species tree using 1 or 0 to indicate if the SOG contains a gene from the corresponding species or not. The resulting tree is called a tree of ortholog groups (or TOGs). We then label the internal nodes of each TOG based on the parsimony principle and some biological constraints. Ortholog groups are finally identified from each fully labeled TOG. In comparison with a popular tool MultiParanoid on simulated data, MultiMSOAR 2.0 shows significantly higher prediction accuracy. It also outperforms MultiParanoid, the Roundup multi-ortholog repository and the Ensembl ortholog database in real data experiments using gene symbols as a validation tool. In addition to ortholog group identification, MultiMSOAR 2.0 also provides information about gene births, duplications and losses in evolution, which may be of independent biological interest. Our experiments on simulated data demonstrate that MultiMSOAR 2.0 is able to infer these evolutionary events much more accurately than a well-known software tool Notung. The software MultiMSOAR 2.0 is available to the public for free.
Accurate genome-wide identification of orthologs is a central problem in
comparative genomics, a fact reflected by the numerous orthology identification
projects developed in recent years. However, only a few reports have compared
their accuracy, and indeed, several recent efforts have not yet been
systematically evaluated. Furthermore, orthology is typically only assessed in
terms of function conservation, despite the phylogeny-based original definition
of Fitch. We collected and mapped the results of nine leading orthology projects
and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene,
RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and
reciprocal smallest distance). We systematically compared their predictions with
respect to both phylogeny and function, using six different tests. This required
the mapping of millions of sequences, the handling of hundreds of millions of
predicted pairs of orthologs, and the computation of tens of thousands of trees.
In phylogenetic analysis or in functional analysis where high specificity is
required, we find that OMA and Homologene perform best. At lower functional
specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and
to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG
can be of interest to build broad functional grouping, but the method is not
specific enough for phylogenetic or detailed function analyses. In terms of
general methodology, we observe that the more sophisticated tree
reconstruction/reconciliation approach of Ensembl Compara was at times
outperformed by pairwise comparison approaches, even in phylogenetic tests.
Furthermore, we show that standard bidirectional best-hit often outperforms
projects with more complex algorithms. First, the present study provides
guidance for the broad community of orthology data users as to which database
best suits their needs. Second, it introduces new methodology to verify
orthology. And third, it sets performance standards for current and future
The identification of orthologs, pairs of homologous genes in different species
that started diverging through speciation events, is a central problem in
genomics with applications in many research areas, including comparative
genomics, phylogenetics, protein function annotation, and genome rearrangement.
An increasing number of projects aim at inferring orthologs from complete
genomes, but little is known about their relative accuracy or coverage. Because
the exact evolutionary history of entire genomes remains largely unknown,
predictions can only be validated indirectly, that is, in the context of the
different applications of orthology. The few comparison studies published so far
have asssessed orthology exclusively from the expectation that orthologs have
conserved protein function. In the present work, we introduce methodology to
verify orthology in terms of phylogeny and perform a comprehensive comparison of
nine leading ortholog inference projects and two methods using both phylogenetic
and functional tests. The results show large variations among the different
projects in terms of performances, which indicates that the choice of orthology
database can have a strong impact on any downstream analysis.
Functional proteomic profiling can help identify targets for disease diagnosis and therapy. Available methods are limited by the inability to profile many functional properties measured by enzymes kinetics. The functional proteomic profiling approach proposed here seeks to overcome such limitations. It begins with surface-based proteome separations of tissue/cell-line extracts, using SeraFILE, a proprietary protein separations platform. Enzyme kinetic properties of resulting subproteomes are then characterized, and the data integrated into proteomic profiles. As a model, SeraFILE-derived subproteomes of cyclic nucleotide-hydrolyzing phosphodiesterases (PDEs) from bovine brain homogenate (BBH) and rat brain homogenate (RBH) were characterized for cAMP hydrolysis activity in the presence (challenge condition) and absence of cGMP. Functional profiles of RBH and BBH were compiled from the enzyme activity response to the challenge condition in each of the respective subproteomes. Intersample analysis showed that comparable profiles differed in only a few data points, and that distinctive subproteomes can be generated from comparable tissue samples from different animals. These results demonstrate that the proposed methods provide a means to simplify intersample differences, and to localize proteins attributable to sample-specific responses. It can be potentially applied for disease and nondisease sample comparison in biomarker discovery and drug discovery profiling.
Carnitine is essential for mitochondrial β-oxidation of long-chain fatty acids. Deficiency of carnitine leads to severe gut atrophy, ulceration and inflammation in animal models of carnitine deficiency. Genetic studies in large populations have linked mutations in the carnitine transporters OCTN1 and OCTN2 with Crohn’s disease (CD), while other studies at the same time have failed to show a similar association and report normal serum carnitine levels in CD patients.
In this report, we have studied the expression of carnitine-synthesizing enzymes in intestinal epithelial cells to determine the capability of these cells to synthesize carnitine de novo. We studied expression of five enzymes involved in carnitine biosynthesis, namely 6-N-trimethyllysine dioxygenase (TMLD), 4-trimethylaminobutyraldehyde dehydrogenase (TMABADH), serine hydroxymethyltransferase 1 & 2 (SHMT1 & 2) and γ-butyrobetaine hydroxylase (BBH) by real-time PCR in mice (C3H strain). We also measured activity of γ-BBH in the intestine using an ex vivo assay and localized its expression by in situ hybridization.
Our investigations show that mouse intestinal epithelium expresses all five enzymes required for de novo carnitine biosynthesis; the expression is localized mainly in villous surface epithelial cells throughout the intestine. The final rate-limiting enzyme γ-BBH is highly active in the small intestine; its activity was 9.7 ± 3.5 pmol/mg/min, compared to 22.7 ± 7.3 pmol/mg/min in the liver.
We conclude that mouse gut epithelium is able to synthesize carnitine de novo. This capacity to synthesize carnitine in the intestine may play an important role in gut health and can help explain lack of clinical carnitine deficiency signs in subjects with mutations with OCTN transporters.
Carnitine; Crohn’s disease; γ-butyrobetaine hydroxylase; gut inflammation; ulcerative colitis
Metal chelators have gained much attention as potential anti-cancer agents. However, the effects of chelators are often linked solely to their capacity to bind iron while the potential complexation of other trace metals has not been fully investigated. In present study, we evaluated the effects of various lipophilic aroylhydrazone chelators (AHC), including novel compound HNTMB, on various ovarian cancer cell lines (SKOV-3, OVCAR-3, NUTU-19).
Cell viability was analyzed via MTS cytotoxicity assays and NCI60 cancer cell growth screens. Apoptotic events were monitored via Western Blot analysis, fluorescence microscopy and TUNEL assay. FACS analysis was carried out to study Cell Cycle regulation and detection of intracellular Reactive Oxygen Species (ROS)
HNTMB displayed high cytotoxicity (IC50 200-400 nM) compared to previously developed AHC (oVtBBH, HNtBBH, StBBH/206, HNTh2H/315, HNI/311; IC50 0.8-6 μM) or cancer drug Deferoxamine, a hexadentate iron-chelator (IC50 12-25 μM). In a NCI60 cancer cell line screen HNTMB exhibited growth inhibitory effects with remarkable differences in specificity depending on the cell line studied (GI50 10 nM-2.4 μM). In SKOV-3 ovarian cancer cells HNTMB treatment led to chromatin fragmentation and activation of the extrinsic and intrinsic pathways of apoptosis with specific down-regulation of Bcl-2. HNTMB caused delayed cell cycle progression of SKOV-3 through G2/M phase arrest. HNTMB can chelate iron and copper of different oxidation states. Complexation with copper lead to high cytotoxicity via generation of reactive oxygen species (ROS) while treatment with iron complexes of the drug caused neither cytotoxicity nor increased ROS levels.
The present report suggests that both, non-complexed HNTMB as a chelator of intracellular trace-metals as well as a cytotoxic HNTMB/copper complex may be developed as potential therapeutic drugs in the treatment of ovarian and other solid tumors.
Ortholog assignment is a critical and fundamental problem in comparative genomics, since orthologs are considered to be functional counterparts in different species and can be used to infer molecular functions of one species from those of other species. MSOAR is a recently developed high-throughput system for assigning one-to-one orthologs between closely related species on a genome scale. It attempts to reconstruct the evolutionary history of input genomes in terms of genome rearrangement and gene duplication events. It assumes that a gene duplication event inserts a duplicated gene into the genome of interest at a random location (i.e., the random duplication model). However, in practice, biologists believe that genes are often duplicated by tandem duplications, where a duplicated gene is located next to the original copy (i.e., the tandem duplication model).
In this paper, we develop MSOAR 2.0, an improved system for one-to-one ortholog assignment. For a pair of input genomes, the system first focuses on the tandemly duplicated genes of each genome and tries to identify among them those that were duplicated after the speciation (i.e., the so-called inparalogs), using a simple phylogenetic tree reconciliation method. For each such set of tandemly duplicated inparalogs, all but one gene will be deleted from the concerned genome (because they cannot possibly appear in any one-to-one ortholog pairs), and MSOAR is invoked. Using both simulated and real data experiments, we show that MSOAR 2.0 is able to achieve a better sensitivity and specificity than MSOAR. In comparison with the well-known genome-scale ortholog assignment tool InParanoid, Ensembl ortholog database, and the orthology information extracted from the well-known whole-genome multiple alignment program MultiZ, MSOAR 2.0 shows the highest sensitivity. Although the specificity of MSOAR 2.0 is slightly worse than that of InParanoid in the real data experiments, it is actually better than that of InParanoid in the simulation tests.
Our preliminary experimental results demonstrate that MSOAR 2.0 is a highly accurate tool for one-to-one ortholog assignment between closely related genomes. The software is available to the public for free and included as online supplementary material.
Species belonging to the Rhizobiales are intriguing and extensively researched for including both bacteria with the ability to fix nitrogen when in symbiosis with leguminous plants and pathogenic bacteria to animals and plants. Similarities between the strategies adopted by pathogenic and symbiotic Rhizobiales have been described, as well as high variability related to events of horizontal gene transfer. Although it is well known that chromosomal rearrangements, mutations and horizontal gene transfer influence the dynamics of bacterial genomes, in Rhizobiales, the scenario that determine pathogenic or symbiotic lifestyle are not clear and there are very few studies of comparative genomic between these classes of prokaryotic microorganisms trying to delineate the evolutionary characterization of symbiosis and pathogenesis.
Non-symbiotic nitrogen-fixing bacteria and bacteria involved in bioremediation closer to symbionts and pathogens in study may assist in the origin and ancestry genes and the gene flow occurring in Rhizobiales. The genomic comparisons of 19 species of Rhizobiales, including nitrogen-fixing, bioremediators and pathogens resulted in 33 common clusters to biological nitrogen fixation and pathogenesis, 15 clusters exclusive to all nitrogen-fixing bacteria and bacteria involved in bioremediation, 13 clusters found in only some nitrogen-fixing and bioremediation bacteria, 01 cluster exclusive to some symbionts, and 01 cluster found only in some pathogens analyzed. In BBH performed to all strains studied, 77 common genes were obtained, 17 of which were related to biological nitrogen fixation and pathogenesis. Phylogenetic reconstructions for Fix, Nif, Nod, Vir, and Trb showed possible horizontal gene transfer events, grouping species of different phenotypes.
The presence of symbiotic and virulence genes in both pathogens and symbionts does not seem to be the only determinant factor for lifestyle evolution in these microorganisms, although they may act in common stages of host infection. The phylogenetic analysis for many distinct operons involved in these processes emphasizes the relevance of horizontal gene transfer events in the symbiotic and pathogenic similarity.
The ortholog conjecture posits that orthologous genes are functionally more similar than paralogous genes. This conjecture is a cornerstone of phylogenomics and is used daily by both computational and experimental biologists in predicting, interpreting, and understanding gene functions. A recent study, however, challenged the ortholog conjecture on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data in human and mouse. It instead proposed that the functional similarity of homologous genes is primarily determined by the cellular context in which the genes act, explaining why a greater functional similarity of (within-species) paralogs than (between-species) orthologs was observed. Here we show that GO-based functional similarity between human and mouse orthologs, relative to that between paralogs, has been increasing in the last five years. Further, compared with paralogs, orthologs are less likely to be included in the same study, causing an underestimation in their functional similarity. A close examination of functional studies of homologs with identical protein sequences reveals experimental biases, annotation errors, and homology-based functional inferences that are labeled in GO as experimental. These problems and the temporary nature of the GO-based finding make the current GO inappropriate for testing the ortholog conjecture. RNA sequencing (RNA-Seq) is known to be superior to microarray for comparing the expressions of different genes or in different species. Our analysis of a large RNA-Seq dataset of multiple tissues from eight mammals and the chicken shows that the expression similarity between orthologs is significantly higher than that between within-species paralogs, supporting the ortholog conjecture and refuting the cellular context hypothesis for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.
Today's exceedingly high speed of genome sequencing, compared with the generally slow pace of functional assay, means that the functions of most genes identified from genome sequences will be annotated only through computational prediction. The primary source of information for this prediction is the functions of orthologous genes in model organisms, because orthologs are widely believed to be functionally similar, especially when compared with paralogs. This belief, known as the ortholog conjecture, was recently challenged on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data, because these data revealed greater functional and expressional similarities of paralogs than orthologs. Here we show that GO-based estimates of functional similarities are temporary and unreliable, due to experimental biases, annotation errors, and homology-based functional inferences that are incorrectly labeled as experimental in GO. RNA sequencing (RNA-Seq) is superior to microarray for comparing the expressions of different genes or in different species, and our analysis of a large RNA-Seq dataset provides strong support to the ortholog conjecture for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.
The fluorescence enhancement of berberine hydrochloride (BBH) as a result of complex with β-cyclodextrin (β-CD) is investigated. The mechanism of the inclusion was studied and discussed by spectrofluoremetry and infrared spectrograms. The results showed that a 1∶1 (β-CD: BBH) complex was formed with an apparent association constant of 4.23×102 L/mol. Based on the enhancement of the fluorescent intensity of berberine hydrochloride, a new spectrofluorimetric method for the determination of BBH in the presence of β-CD was developed. The linear range was 1.00∼4.00 µg/mL with the detection limit of 5.54 ng/mL. The proposed method was successfully applied to the determination of BBH in tablets.
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
Moraxella catarrhalis is a mucosal pathogen that causes childhood otitis media and exacerbations of chronic obstructive pulmonary disease in adults. During the course of infection, M. catarrhalis needs to adhere to epithelial cells of different host niches such as the nasopharynx and lungs, and consequently, efficient adhesion to epithelial cells is considered an important virulence trait of M. catarrhalis. By using Tn-seq, a genome-wide negative selection screenings technology, we identified 15 genes potentially required for adherence of M. catarrhalis BBH18 to pharyngeal epithelial Detroit 562 and lung epithelial A549 cells. Validation with directed deletion mutants confirmed the importance of aroA (3-phosphoshikimate 1-carboxyvinyl-transferase), ecnAB (entericidin EcnAB), lgt1 (glucosyltransferase), and MCR_1483 (outer membrane lipoprotein) for cellular adherence, with ΔMCR_1483 being most severely attenuated in adherence to both cell lines. Expression profiling of M. catarrhalis BBH18 during adherence to Detroit 562 cells showed increased expression of 34 genes in cell-attached versus planktonic bacteria, among which ABC transporters for molybdate and sulfate, while reduced expression of 16 genes was observed. Notably, neither the newly identified genes affecting adhesion nor known adhesion genes were differentially expressed during adhesion, but appeared to be constitutively expressed at a high level. Profiling of the transcriptional response of Detroit 562 cells upon adherence of M. catarrhalis BBH18 showed induction of a panel of pro-inflammatory genes as well as genes involved in the prevention of damage of the epithelial barrier. In conclusion, this study provides new insight into the molecular interplay between M. catarrhalis and host epithelial cells during the process of adherence.
Bacterial cell-cell communication is mediated by small signaling molecules known as autoinducers. Importantly, autoinducer-2 (AI-2) is synthesized via the enzyme LuxS in over 80 species, some of which mediate their pathogenicity by recognizing and transducing this signal in a cell density dependent manner. AI-2 mediated phenotypes are not well understood however, as the means for signal transduction appears varied among species, while AI-2 synthesis processes appear conserved. Approaches to reveal the recognition pathways of AI-2 will shed light on pathogenicity as we believe recognition of the signal is likely as important, if not more, than the signal synthesis. LMNAST (Local Modular Network Alignment Similarity Tool) uses a local similarity search heuristic to study gene order, generating homology hits for the genomic arrangement of a query gene sequence. We develop and apply this tool for the E. coli lac and LuxS regulated (Lsr) systems. Lsr is of great interest as it mediates AI-2 uptake and processing. Both test searches generated results that were subsequently analyzed through a number of different lenses, each with its own level of granularity, from a binary phylogenetic representation down to trackback plots that preserve genomic organizational information. Through a survey of these results, we demonstrate the identification of orthologs, paralogs, hitchhiking genes, gene loss, gene rearrangement within an operon context, and also horizontal gene transfer (HGT). We found a variety of operon structures that are consistent with our hypothesis that the signal can be perceived and transduced by homologous protein complexes, while their regulation may be key to defining subsequent phenotypic behavior.
Bacteria communicate with each other through a network of small molecules that are secreted and perceived by nearest neighbors. In a process known as quorum sensing, bacteria communicate their cell density and certain behaviors emerge wherein the population of cells acts as a coordinated community. One small signaling molecule, AI-2, is synthesized by many bacteria so that in a natural ecosystem comprised of many secreting cells of different species, the molecule may be present in an appreciable concentration. The perception of the signal may be key to unlocking its importance, as some cells may recognize it at lower concentrations than others, etc. We have created a searching algorithm that finds similar gene sets among various bacteria. Here, we looked for signal transduction pathways similar to the one studied in E. coli. We found exact replicas to that of E. coli, but also found pathways with missing genes, added genes of unknown function, as well as different patterns by which the genes may be regulated. We suspect these attributes may play a significant role in determining quorum sensing behaviors. This, in turn, may lead to new discoveries for controlling groups of bacteria and possibly reducing the prevalence of infectious disease.
Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.
To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) – identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida – Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at (software under GNU General Public License).
The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a “transducer composition” algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.
A number of leading methods for bioinformatics analysis of structural RNAs use probabilistic grammars as models for pairs of homologous RNAs. We show that any such pairwise grammar can be extended to an entire phylogeny by treating the pairwise grammar as a machine (a “transducer”) that models a single ancestor-descendant relationship in the tree, transforming one RNA structure into another. In addition to phylogenetic enhancement of current applications, such as RNA genefinding, homology detection, alignment and secondary structure prediction, this should enable probabilistic phylogenetic reconstruction of RNA sequences that are ancestral to present-day genes. We describe statistical inference algorithms, software implementations, and a simulation-based comparison of three-taxon maximum likelihood alignment to several other methods for aligning three sibling RNAs. In the Discussion we consider how the three-taxon RNA alignment-reconstruction-folding algorithm, which is currently very computationally-expensive, might be made more efficient so that larger phylogenies could be considered.
The conditionally essential nutrient, L-carnitine, plays a critical role in a number of physiological processes vital to normal neonatal growth and development. We conducted a systematic evaluation of the developmental changes in key L-carnitine homeostasis mechanisms in the postnatal rat to better understand the interrelationship between these pathways and their correlation to ontogenic changes in L-carnitine levels during postnatal development.
mRNA expression of heart, kidney and intestinal L-carnitine transporters, liver γ-butyrobetaine hydroxylase (Bbh) and trimethyllysine hydroxylase (Tmlh), and heart carnitine palmitoyltransferase (Cpt) were measured using quantitative RT-PCR. L-Carnitine levels were determined by HPLC-UV. Cpt and Bbh activity were measured by a spectrophotometric method and HPLC, respectively.
Serum and heart L-carnitine levels increased with postnatal development. Increases in serum L-carnitine correlated significantly with postnatal increases in renal organic cation/carnitine transporter 2 (Octn2) expression, and was further matched by postnatal increases in intestinal Octn1 expression and hepatic γ-Bbh activity. Postnatal increases in heart L-carnitine levels were significantly correlated to postnatal increases in heart Octn2 expression. Although cardiac high energy phosphate substrate levels remained constant through postnatal development, creatine showed developmental increases with advancing neonatal age. mRNA levels of Cpt1b and Cpt2 significantly increased at postnatal day 20, which was not accompanied by a similar increase in activity.
Several L-carnitine homeostasis pathways underwent significant ontogenesis during postnatal development in the rat. This information will facilitate future studies on factors affecting the developmental maturation of L-carnitine homeostasis mechanisms and how such factors might affect growth and development.
L-Carnitine; Homeostasis; Postnatal development; Rat
The reconstruction of ancestral genome architectures and gene orders from homologies between extant species is a long-standing problem, considered by both cytogeneticists and bioinformaticians. A comparison of the two approaches was recently investigated and discussed in a series of papers, sometimes with diverging points of view regarding the performance of these two approaches. We describe a general methodological framework for reconstructing ancestral genome segments from conserved syntenies in extant genomes. We show that this problem, from a computational point of view, is naturally related to physical mapping of chromosomes and benefits from using combinatorial tools developed in this scope. We develop this framework into a new reconstruction method considering conserved gene clusters with similar gene content, mimicking principles used in most cytogenetic studies, although on a different kind of data. We implement and apply it to datasets of mammalian genomes. We perform intensive theoretical and experimental comparisons with other bioinformatics methods for ancestral genome segments reconstruction. We show that the method that we propose is stable and reliable: it gives convergent results using several kinds of data at different levels of resolution, and all predicted ancestral regions are well supported. The results come eventually very close to cytogenetics studies. It suggests that the comparison of methods for ancestral genome reconstruction should include the algorithmic aspects of the methods as well as the disciplinary differences in data aquisition.
No DNA molecule is preserved after a few hundred thousand years, so inferring the DNA sequence organization of ancient living organisms beyond several million years can only be achieved by computational estimations, using the similarities and differences between chromosomes of extant species. This is the scope of “paleogenomics”, and it can help to better understand how genomes have evolved until today. We propose here a computational framework to estimate contiguous segments of ancestral chromosomes, based on techniques of physical mapping that are used to infer chromosome maps of extant species when their genome is not sequenced. This framework is not guided by possible evolutionary events such as rearrangements but only proposes ancestral genome architectures. We developed a method following this framework and applied it to mammalian genomes. We inferred ancestral chromosomal regions that are stable and well supported at different levels of resolution. These ancestral chromosomal regions agree with previous cytogenetics studies and were very probably part of the genome of the common ancestor of humans, macaca, mice, dogs, and cows, living 120 million years ago. We illustrate, through comparison with other bioinformatics methods, the importance of a formal methodological background when comparing ancestral genome architecture proposals obtained from different methods.