MODOMICS, a database devoted to the systems biology of RNA modification, has been subjected to substantial improvements. It provides comprehensive information on the chemical structure of modified nucleosides, pathways of their biosynthesis, sequences of RNAs containing these modifications and RNA-modifying enzymes. MODOMICS also provides cross-references to other databases and to literature. In addition to the previously available manually curated tRNA sequences from a few model organisms, we have now included additional tRNAs and rRNAs, and all RNAs with 3D structures in the Nucleic Acid Database, in which modified nucleosides are present. In total, 3460 modified bases in RNA sequences of different organisms have been annotated. New RNA-modifying enzymes have been also added. The current collection of enzymes includes mainly proteins for the model organisms Escherichia coli and Saccharomyces cerevisiae, and is currently being expanded to include proteins from other organisms, in particular Archaea and Homo sapiens. For enzymes with known structures, links are provided to the corresponding Protein Data Bank entries, while for many others homology models have been created. Many new options for database searching and querying have been included. MODOMICS can be accessed at http://genesilico.pl/modomics.
MODOMICS is a database of RNA modifications that provides comprehensive information concerning the chemical structures of modified ribonucleosides, their biosynthetic pathways, RNA-modifying enzymes and location of modified residues in RNA sequences. In the current database version, accessible at http://modomics.genesilico.pl, we included new features: a census of human and yeast snoRNAs involved in RNA-guided RNA modification, a new section covering the 5′-end capping process, and a catalogue of ‘building blocks’ for chemical synthesis of a large variety of modified nucleosides. The MODOMICS collections of RNA modifications, RNA-modifying enzymes and modified RNAs have been also updated. A number of newly identified modified ribonucleosides and more than one hundred functionally and structurally characterized proteins from various organisms have been added. In the RNA sequences section, snRNAs and snoRNAs with experimentally mapped modified nucleosides have been added and the current collection of rRNA and tRNA sequences has been substantially enlarged. To facilitate literature searches, each record in MODOMICS has been cross-referenced to other databases and to selected key publications. New options for database searching and querying have been implemented, including a BLAST search of protein sequences and a PARALIGN search of the collected nucleic acid sequences.
In past number of methods have been developed for predicting post-translational modifications in proteins. In contrast, limited attempt has been made to understand post-transcriptional modifications. Recently it has been shown that tRNA modifications play direct role in the genome structure and codon usage. This study is an attempt to understand kingdom-wise tRNA modifications particularly uridine modifications (UMs), as majority of modifications are uridine-derived.
A three-steps strategy has been applied to develop an efficient method for the prediction of UMs. In the first step, we developed a common prediction model for all the kingdoms using a dataset from MODOMICS-2008. Support Vector Machine (SVM) based prediction models were developed and evaluated by five-fold cross-validation technique. Different approaches were applied and found that a hybrid approach of binary and structural information achieved highest Area under the curve (AUC) of 0.936. In the second step, we used newly added tRNA sequences (as independent dataset) of MODOMICS-2012 for the kingdom-wise prediction performance evaluation of previously developed (in the first step) common model and achieved performances between the AUC of 0.910 to 0.949. In the third and last step, we used different datasets from MODOMICS-2012 for the kingdom-wise individual prediction models development and achieved performances between the AUC of 0.915 to 0.987.
The hybrid approach is efficient not only to predict kingdom-wise modifications but also to classify them into two most prominent UMs: Pseudouridine (Y) and Dihydrouridine (D). A webserver called tRNAmod (http://crdd.osdd.net/raghava/trnamod/) has been developed, which predicts UMs from both tRNA sequences and whole genome.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-326) contains supplementary material, which is available to authorized users.
Uridine modifications; Pseudouridine; Dihydrouridine; 5-methyl-uridine; tRNAmod
Naturally occurring RNAs contain numerous enzymatically altered nucleosides. Differences in RNA populations (RNomics) and pattern of RNA modifications (Modomics) depends on the organism analyzed and are two of the criteria that distinguish the three kingdoms of life. If the genomic sequences of the RNA molecules can be derived from whole genome sequence information, the modification profile cannot and requires or direct sequencing of the RNAs or predictive methods base on the presence or absence of the modifications genes.
By employing a comparative genomics approach, we predicted almost all of the genes coding for the t+rRNA modification enzymes in the mesophilic moderate halophile Haloferax volcanii. These encode both guide RNAs and enzymes. Some are orthologous to previously identified genes in Archaea, Bacteria or in Saccharomyces cerevisiae, but several are original predictions.
The number of modifications in t+rRNAs in the halophilic archaeon is surprisingly low when compared with other Archaea or Bacteria, particularly the hyperthermophilic organisms. This may result from the specific lifestyle of halophiles that require high intracellular salt concentration for survival. This salt content could allow RNA to maintain its functional structural integrity with fewer modifications. We predict that the few modifications present must be particularly important for decoding, accuracy of translation or are modifications that cannot be functionally replaced by the electrostatic interactions provided by the surrounding salt-ions. This analysis also guides future experimental validation work aiming to complete the understanding of the function of RNA modifications in Archaeal translation.
N1-methylation of adenosine to m1A occurs in several different positions in tRNAs from various organisms. A methyl group at position N1 prevents Watson–Crick-type base pairing by adenosine and is therefore important for regulation of structure and stability of tRNA molecules. Thus far, only one family of genes encoding enzymes responsible for m1A methylation at position 58 has been identified, while other m1A methyltransferases (MTases) remain elusive. Here, we show that Bacillus subtilis open reading frame yqfN is necessary and sufficient for N1-adenosine methylation at position 22 of bacterial tRNA. Thus, we propose to rename YqfN as TrmK, according to the traditional nomenclature for bacterial tRNA MTases, or TrMet(m1A22) according to the nomenclature from the MODOMICS database of RNA modification enzymes. tRNAs purified from a ΔtrmK strain are a good substrate in vitro for the recombinant TrmK protein, which is sufficient for m1A methylation at position 22 as are tRNAs from Escherichia coli, which natively lacks m1A22. TrmK is conserved in Gram-positive bacteria and present in some Gram-negative bacteria, but its orthologs are apparently absent from archaea and eukaryota. Protein structure prediction indicates that the active site of TrmK does not resemble the active site of the m1A58 MTase TrmI, suggesting that these two enzymatic activities evolved independently.
The URM1 pathway functions in a tRNA thiolation reaction that is required for synthesis of the mcm5s2U34 nucleoside found in tRNAs. Growth of Saccharomyces cerevisiae cells at an elevated temperature results in altered levels of modification enzymes, and this leads to decreased levels of tRNA thiolation. tRNA thiolation is tied to cellular stress responses.
Although tRNA modifications have been well catalogued, the precise functions of many modifications and their roles in mediating gene expression are still being elucidated. Whereas tRNA modifications were long assumed to be constitutive, it is now apparent that the modification status of tRNAs changes in response to different environmental conditions. The URM1 pathway is required for thiolation of the cytoplasmic tRNAs tGluUUC, tGlnUUG, and tLysUUU in Saccharomyces cerevisiae. We demonstrate that URM1 pathway mutants have impaired translation, which results in increased basal activation of the Hsf1-mediated heat shock response; we also find that tRNA thiolation levels in wild-type cells decrease when cells are grown at elevated temperature. We show that defects in tRNA thiolation can be conditionally advantageous, conferring resistance to endoplasmic reticulum stress. URM1 pathway proteins are unstable and hence are more sensitive to changes in the translational capacity of cells, which is decreased in cells experiencing stresses. We propose a model in which a stress-induced decrease in translation results in decreased levels of URM1 pathway components, which results in decreased tRNA thiolation levels, which further serves to decrease translation. This mechanism ensures that tRNA thiolation and translation are tightly coupled and coregulated according to need.
This compilation presents in a small space the tRNA sequences so far published in order to enable rapid orientation and comparison. The numbering of tRNAPhe from yeast is used as has been done earlier (1) but following the rules proposed by the participants of the Cold Spring Harbor Meeting on tRNA 1978 (2) (Fig. 1). This numbering allows comparisons with the three dimensional structure of tRNAPhe, the only structure known from X-ray analysis. The secondary structure of tRNAs is indicated by specific underlining. In the primary structure a nucleoside followed by a nucleoside in brackets or a modification in brackets denotes that both types of nucleosides can occupy this position. Part of a sequence in brackets designates a piece of sequence not unambiguously analyzed. Rare nucleosides are named according to the IUPAC-IUB rules (for some more complicated rare nucleosides and their identification see Table 1); those with lengthy names are given with the prefix x and specified in the footnotes. Footnotes are numbered according to the coordinates of the corresponding nucleoside and are indicated in the sequence by an asterisk. The references are restricted to the citation of the latest publication in those cases where several papers deal with one sequence. For additional information the reader is referred either to the original literature or to other tRNA sequence compilations (3--7). Mutant tRNAs are dealt with in a separate compilation prepared by J. Celis (see below). The compilers would welcome any information by the readers regarding missing material or erroneous presentation. On the basis of this numbering system computer printed compilations of tRNA sequences in a linear form and in cloverleaf form are in preparation.
This compilation presents in a small space the tRNA sequences so far published. The numbering of tRNAPhe from yeast is used following the rules proposed by the participants of the Cold Spring Harbor Meeting on tRNA 1978 (1,2;Fig. 1). This numbering allows comparisons with the three dimensional structure of tRNAPhe. The secondary structure of tRNAs is indicated by specific underlining. In the primary structure a nucleoside followed by a nucleoside in brackets or a modification in brackets denotes that both types of nucleosides can occupy this position. Part of a sequence in brackets designates a piece of sequence not unambiguosly analyzed. Rare nucleosides are named according to the IUPACIUB rules (for complicated rare nucleosides and their identification see Table 1); those with lengthy names are given with the prefix x and specified in the footnotes. Footnotes are numbered according to the coordinates of the corresponding nucleoside and are indicated in the sequence by an asterisk. The references are restricted to the citation of the latest publication in those cases where several papers deal with one sequence. For additional information the reader is referred either to the original literature or to other tRNA sequence compilations (3-7). Mutant tRNAs are dealt with in a compilation by J. Celis (8). The compilers would welcome any information by the readers regarding missing material or erroneous presentation. On the basis of this numbering system computer printed compilations of tRNA sequences in a linear form and in cloverleaf form are in preparation.
Small nucleolar RNAs (snoRNAs) and Cajal body-specific RNAs (scaRNAs) are named for their subcellular localization within nucleoli and Cajal bodies (conserved subnuclear organelles present in the nucleoplasm), respectively. They have been found to play important roles in rRNA, tRNA, snRNAs, and even mRNA modification and processing. All snoRNAs fall in two categories, box C/D snoRNAs and box H/ACA snoRNAs, according to their distinct sequence and secondary structure features. Box C/D snoRNAs and box H/ACA snoRNAs mainly function in guiding 2′-O-ribose methylation and pseudouridilation, respectively. ScaRNAs possess both box C/D snoRNA and box H/ACA snoRNA sequence motif features, but guide snRNA modifications that are transcribed by RNA polymerase II. Here we present a Web-based sno/scaRNA database, called sno/scaRNAbase, to facilitate the sno/scaRNA research in terms of providing a more comprehensive knowledge base. Covering 1979 records derived from 85 organisms for the first time, sno/scaRNAbase is not only dedicated to filling gaps between existing organism-specific sno/scaRNA databases that are focused on different sno/scaRNA aspects, but also provides sno/scaRNA scientists with an opportunity to adopt a unified nomenclature for sno/scaRNAs. Derived from a systematic literature curation and annotation effort, the sno/scaRNAbase provides an easy-to-use gateway to important sno/scaRNA features such as sequence motifs, possible functions, homologues, secondary structures, genomics organization, sno/scaRNA gene's chromosome location, and more. Approximate searches, in addition to accurate and straightforward searches, make the database search more flexible. A BLAST search engine is implemented to enable blast of query sequences against all sno/scaRNAbase sequences. Thus our sno/scaRNAbase serves as a more uniform and friendly platform for sno/scaRNA research. The database is free available at .
MetaCyc is a database of metabolic pathways and enzymes located at . Its goal is to serve as a metabolic encyclopedia, containing a collection of non-redundant pathways central to small molecule metabolism, which have been reported in the experimental literature. Most of the pathways in MetaCyc occur in microorganisms and plants, although animal pathways are also represented. MetaCyc contains metabolic pathways, enzymatic reactions, enzymes, chemical compounds, genes and review-level comments. Enzyme information includes substrate specificity, kinetic properties, activators, inhibitors, cofactor requirements and links to sequence and structure databases. Data are curated from the primary literature by curators with expertise in biochemistry and molecular biology. MetaCyc serves as a readily accessible comprehensive resource on microbial and plant pathways for genome analysis, basic research, education, metabolic engineering and systems biology. Querying, visualization and curation of the database is supported by SRI's Pathway Tools software. The PathoLogic component of Pathway Tools is used in conjunction with MetaCyc to predict the metabolic network of an organism from its annotated genome. SRI and the European Bioinformatics Institute employed this tool to create pathway/genome databases (PGDBs) for 165 organisms, available at the website. These PGDBs also include predicted operons and pathway hole fillers.
Neuropeptides play a variety of roles in many physiological processes and serve as potential therapeutic targets for the treatment of some nervous-system disorders. In recent years, there has been a tremendous increase in the number of identified neuropeptides. Therefore, we have developed NeuroPep, a comprehensive resource of neuropeptides, which holds 5949 non-redundant neuropeptide entries originating from 493 organisms belonging to 65 neuropeptide families. In NeuroPep, the number of neuropeptides in invertebrates and vertebrates is 3455 and 2406, respectively. It is currently the most complete neuropeptide database. We extracted entries deposited in UniProt, the database (www.neuropeptides.nl) and NeuroPedia, and used text mining methods to retrieve entries from the MEDLINE abstracts and full text articles. All the entries in NeuroPep have been manually checked. 2069 of the 5949 (35%) neuropeptide sequences were collected from the scientific literature. Moreover, NeuroPep contains detailed annotations for each entry, including source organisms, tissue specificity, families, names, post-translational modifications, 3D structures (if available) and literature references. Information derived from these peptide sequences such as amino acid compositions, isoelectric points, molecular weight and other physicochemical properties of peptides are also provided. A quick search feature allows users to search the database with keywords such as sequence, name, family, etc., and an advanced search page helps users to combine queries with logical operators like AND/OR. In addition, user-friendly web tools like browsing, sequence alignment and mapping are also integrated into the NeuroPep database.
Database URL: http://isyslab.info/NeuroPep
This article addresses the problem of interoperation of heterogeneous bioinformatics databases.
We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research.
BioWarehouse embodies significant progress on the database integration problem for bioinformatics.
Transfer RNAs of the extreme halophile Haloferax volcanii contain several modified nucleosides, among them 1-methylpseudouridine (m1 psi), pseudouridine (psi), 2'-0-methylcytosine (Cm) and 1-methylinosine (m1l), present in positions 54, 55, 56 and 57 of the psi-loop, respectively. At the same positions in tRNAs from eubacteria and eukaryotes, ribothymidine (T-54), pseudouridine (psi-55), non-modified cytosine (C-56) and non-modified adenosine or guanosine (A-57 or G-57) are found in the so-called T psi-loop. Using as substrate a T7 transcript of Haloferax volcanii tRNA(Ile) devoid of modified nucleosides, the enzymatic activities of several tRNA modification enzymes, including those for m1 psi-54, psi-55, Cm-56 and m1l-57, were detected in cell extracts of H.volcanii. Here, we demonstrate that modification of A-57 into m1l-57 in H.volcanii tRNA(Ile) occurs via a two-step enzymatic process. The first step corresponds to the formation of m1A-57 catalyzed by a S-adenosylmethionine-dependent tRNA methyltransferase, followed by the deamination of the 6-amino group of the adenine moiety by a 1-methyladenosine-57 deaminase. This enzymatic pathway differs from that leading to the formation of m1l-37 in the anticodon loop of eukaryotic tRNA(Ala). In the latter case, inosine-37 formation preceeds the S-adenosylmethionine-dependent methylation of l-37 into m1l-37. Thus, enzymatic strategies for catalyzing the formation of 1-methylinosine in tRNAs differ in organisms from distinct evolutionary kingdoms.
An increasing number of cis-regulatory RNA elements have been found to regulate gene expression post-transcriptionally in various biological processes in bacterial systems. Effective computational tools for large-scale identification of novel regulatory RNAs are strongly desired to facilitate our exploration of gene regulation mechanisms and regulatory networks. We present a new computational program named RSSVM (RNA Sampler+Support Vector Machine), which employs Support Vector Machines (SVMs) for efficient identification of functional RNA motifs from random RNA secondary structures. RSSVM uses a set of distinctive features to represent the common RNA secondary structure and structural alignment predicted by RNA Sampler, a tool for accurate common RNA secondary structure prediction, and is trained with functional RNAs from a variety of bacterial RNA motif/gene families covering a wide range of sequence identities. When tested on a large number of known and random RNA motifs, RSSVM shows a significantly higher sensitivity than other leading RNA identification programs while maintaining the same false positive rate. RSSVM performs particularly well on sets with low sequence identities. The combination of RNA Sampler and RSSVM provides a new, fast, and efficient pipeline for large-scale discovery of regulatory RNA motifs. We applied RSSVM to multiple Shewanella genomes and identified putative regulatory RNA motifs in the 5′ untranslated regions (UTRs) in S. oneidensis, an important bacterial organism with extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. From 1002 sets of 5′-UTRs of orthologous operons, we identified 166 putative regulatory RNA motifs, including 17 of the 19 known RNA motifs from Rfam, an additional 21 RNA motifs that are supported by literature evidence, 72 RNA motifs overlapping predicted transcription terminators or attenuators, and other candidate regulatory RNA motifs. Our study provides a list of promising novel regulatory RNA motifs potentially involved in post-transcriptional gene regulation. Combined with the previous cis-regulatory DNA motif study in S. oneidensis, this genome-wide discovery of cis-regulatory RNA motifs may offer more comprehensive views of gene regulation at a different level in this organism. The RSSVM software, predictions, and analysis results on Shewanella genomes are available at http://ural.wustl.edu/resources.html#RSSVM.
RNA is remarkably versatile, acting not only as messengers to transfer genetic information from DNA to protein but also as critical structural components and catalytic enzymes in the cell. More intriguingly, RNA elements in messenger RNAs have been widely found in bacteria to control the expression of their downstream genes. The functions of these RNA elements are intrinsically linked to their secondary structures, which are usually conserved across multiple closely related species during evolution and often shared by genes in the same metabolic pathways. We developed a new computational approach to find putative functional RNA elements by looking for conserved RNA secondary structures that are distinguished from random RNA secondary structures in the orthologous RNA sequences from related species. We applied this approach to multiple Shewanella genomes and predicted putative regulatory RNA elements in Shewanella oneidensis, a bacterium that has extraordinary respiratory and metal reducing abilities and great potential for bioremediation and alternative energy generation. Our findings not only recovered many RNA elements that are known or supported by literature evidence but also included exciting novel RNA elements for further exploration.
Tuberculosis is an infectious bacterial disease caused by Mycobacterium tuberculosis. It remains a major health threat, killing over one million people every year worldwide. An early antibiotic therapy is the basis of the treatment, and the emergence and spread of multidrug and extensively drug-resistant mutant strains raise significant challenges. As these bacteria grow very slowly, drug resistance mutations are currently detected using molecular biology techniques. Resistance mutations are identified by sequencing the resistance-linked genes followed by a comparison with the literature data. The only online database is the TB Drug Resistance Mutation database (TBDReaM database); however, it requires mutation detection before use, and its interrogation is complex due to its loose syntax and grammar.
The MUBII-TB-DB database is a simple, highly structured text-based database that contains a set of Mycobacterium tuberculosis mutations (DNA and proteins) occurring at seven loci: rpoB, pncA, katG; mabA(fabG1)-inhA, gyrA, gyrB, and rrs. Resistance mutation data were extracted after the systematic review of MEDLINE referenced publications before March 2013. MUBII analyzes the query sequence obtained by PCR-sequencing using two parallel strategies: i) a BLAST search against a set of previously reconstructed mutated sequences and ii) the alignment of the query sequences (DNA and its protein translation) with the wild-type sequences. The post-treatment includes the extraction of the aligned sequences together with their descriptors (position and nature of mutations). The whole procedure is performed using the internet. The results are graphs (alignments) and text (description of the mutation, therapeutic significance). The system is quick and easy to use, even for technicians without bioinformatics training.
MUBII-TB-DB is a structured database of the mutations occurring at seven loci of major therapeutic value in tuberculosis management. Moreover, the system provides interpretation of the mutations in biological and therapeutic terms and can evolve by the addition of newly described mutations. Its goal is to provide easy and comprehensive access through a client–server model over the Web to an up-to-date database of mutations that lead to the resistance of M. tuberculosis to antibiotics.
Tuberculosis; Antibiotics; Mutation database; Sequence database; Web
The Yeast Metabolome Database (YMDB, http://www.ymdb.ca) is a richly annotated ‘metabolomic’ database containing detailed information about the metabolome of Saccharomyces cerevisiae. Modeled closely after the Human Metabolome Database, the YMDB contains >2000 metabolites with links to 995 different genes/proteins, including enzymes and transporters. The information in YMDB has been gathered from hundreds of books, journal articles and electronic databases. In addition to its comprehensive literature-derived data, the YMDB also contains an extensive collection of experimental intracellular and extracellular metabolite concentration data compiled from detailed Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR) metabolomic analyses performed in our lab. This is further supplemented with thousands of NMR and MS spectra collected on pure, reference yeast metabolites. Each metabolite entry in the YMDB contains an average of 80 separate data fields including comprehensive compound description, names and synonyms, structural information, physico-chemical data, reference NMR and MS spectra, intracellular/extracellular concentrations, growth conditions and substrates, pathway information, enzyme data, gene/protein sequence data, as well as numerous hyperlinks to images, references and other public databases. Extensive searching, relational querying and data browsing tools are also provided that support text, chemical structure, spectral, molecular weight and gene/protein sequence queries. Because of S. cervesiae's importance as a model organism for biologists and as a biofactory for industry, we believe this kind of database could have considerable appeal not only to metabolomics researchers, but also to yeast biologists, systems biologists, the industrial fermentation industry, as well as the beer, wine and spirit industry.
To date, more than 90 modified nucleosides have been found in tRNA and the biosynthetic pathways of the majority of tRNA modifications include a methylation step(s). Recent studies of the biosynthetic pathways have demonstrated that the availability of methyl group donors for the methylation in tRNA is important for correct and efficient protein synthesis. In this review, I focus on the methylated nucleosides and tRNA methyltransferases. The primary functions of tRNA methylations are linked to the different steps of protein synthesis, such as the stabilization of tRNA structure, reinforcement of the codon-anticodon interaction, regulation of wobble base pairing, and prevention of frameshift errors. However, beyond these basic functions, recent studies have demonstrated that tRNA methylations are also involved in the RNA quality control system and regulation of tRNA localization in the cell. In a thermophilic eubacterium, tRNA modifications and the modification enzymes form a network that responses to temperature changes. Furthermore, several modifications are involved in genetic diseases, infections, and the immune response. Moreover, structural, biochemical, and bioinformatics studies of tRNA methyltransferases have been clarifying the details of tRNA methyltransferases and have enabled these enzymes to be classified. In the final section, the evolution of modification enzymes is discussed.
RNA modification; RNA methylation; RNA maturation
The Escherichia coli Metabolome Database (ECMDB, http://www.ecmdb.ca) is a comprehensively annotated metabolomic database containing detailed information about the metabolome of E. coli (K-12). Modelled closely on the Human and Yeast Metabolome Databases, the ECMDB contains >2600 metabolites with links to ∼1500 different genes and proteins, including enzymes and transporters. The information in the ECMDB has been collected from dozens of textbooks, journal articles and electronic databases. Each metabolite entry in the ECMDB contains an average of 75 separate data fields, including comprehensive compound descriptions, names and synonyms, chemical taxonomy, compound structural and physicochemical data, bacterial growth conditions and substrates, reactions, pathway information, enzyme data, gene/protein sequence data and numerous hyperlinks to images, references and other public databases. The ECMDB also includes an extensive collection of intracellular metabolite concentration data compiled from our own work as well as other published metabolomic studies. This information is further supplemented with thousands of fully assigned reference nuclear magnetic resonance and mass spectrometry spectra obtained from pure E. coli metabolites that we (and others) have collected. Extensive searching, relational querying and data browsing tools are also provided that support text, chemical structure, spectral, molecular weight and gene/protein sequence queries. Because of E. coli’s importance as a model organism for biologists and as a biofactory for industry, we believe this kind of database could have considerable appeal not only to metabolomics researchers but also to molecular biologists, systems biologists and individuals in the biotechnology industry.
RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
There are two types of RNA: messenger RNAs (mRNAs), which are translated into proteins, and non-coding RNAs (ncRNAs), which function as RNA molecules. Besides textbook examples such as tRNAs and rRNAs, non-coding RNAs have been found to carry out very diverse functions, from mRNA splicing and RNA modification to translational regulation. It has been estimated that non-coding RNAs make up the vast majority of transcription output of higher eukaryotes. Discriminating mRNA from ncRNA has become an important biological and computational problem. The authors describe a computational method based on a machine learning algorithm known as a support vector machine (SVM) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, secondary structure content, and protein alignment information. The method is applied to the dataset from the FANTOM3 large-scale mouse cDNA sequencing project; it identifies over 14,000 ncRNAs in mouse and estimates the total number of ncRNAs in the FANTOM3 data to be about 28,000.
Summary of Recent Advances
The maturation of transfer RNA (tRNA) involves extensive chemical modification of the constituent nucleosides, resulting in the formation of structurally diverse nucleosides. Many of the pathways to these modified nucleosides are characterized by chemically complex transformations, some of which are unprecedented in other areas of biology. To illustrate the scope of the field, recent progress in understanding the enzymology leading to the formation of 2 distinct classes of modified nucleosides are reviewed, the thiouridines and queuosine, a 7-deazaguanosine. In particular, recent data validating the involvement of several proposed intermediates in the formation of thiouridines are discussed, including 2 key enzyme intermediates and the activated tRNA intermediate. The discovery and mechanistic characterization of a new enzyme activity in the queuosine pathway is discussed.
The BRENDA enzyme information system (http://www.brenda-enzymes.org/) has developed into an elaborate system of enzyme and enzyme-ligand information obtained from different sources, combined with flexible query systems and evaluation tools. The information is obtained by manual extraction from primary literature, text and data mining, data integration, and prediction algorithms. Approximately 300 million data include enzyme function and molecular data from more than 30 000 organisms. The manually derived core contains 3 million data from 77 000 enzymes annotated from 135 000 literature references. Each entry is connected to the literature reference and the source organism. They are complemented by information on occurrence, enzyme/disease relationships from text mining, sequences and 3D structures from other databases, and predicted enzyme location and genome annotation. Functional and structural data of more than 190 000 enzyme ligands are stored in BRENDA. New features improving the functionality and analysis tools were implemented. The human anatomy atlas CAVEman is linked to the BRENDA Tissue Ontology terms providing a connection between anatomical and functional enzyme data. Word Maps for enzymes obtained from PubMed abstracts highlight application and scientific relevance of enzymes. The EnzymeDetector genome annotation tool and the reaction database BKM-react including reactions from BRENDA, KEGG and MetaCyc were improved. The website was redesigned providing new query options.
High throughput proteomics experiments are useful for analyzing the protein expression of an organism, identifying the correct gene structure of a genome, or locating possible post-translational modifications within proteins. High throughput methods necessitate publicly accessible and easily queried databases for efficiently and logically storing, displaying, and analyzing the large volume of data.
EPICDB is a publicly accessible, queryable, relational database that organizes and displays experimental, high throughput proteomics data for Toxoplasma gondii and Cryptosporidium parvum. Along with detailed information on mass spectrometry experiments, the database also provides antibody experimental results and analysis of functional annotations, comparative genomics, and aligned expressed sequence tag (EST) and genomic open reading frame (ORF) sequences. The database contains all available alternative gene datasets for each organism, which comprises a complete theoretical proteome for the respective organism, and all data is referenced to these sequences. The database is structured around clusters of protein sequences, which allows for the evaluation of redundancy, protein prediction discrepancies, and possible splice variants. The database can be expanded to include genomes of other organisms for which proteome-wide experimental data are available.
EPICDB is a comprehensive database of genome-wide T. gondii and C. parvum proteomics data and incorporates many features that allow for the analysis of the entire proteomes and/or annotation of specific protein sequences. EPICDB is complementary to other -genomics- databases of these organisms by offering complete mass spectrometry analysis on a comprehensive set of all available protein sequences.
The RNA modification database provides a comprehensive listing of posttranscriptionally modified nucleosides from RNA, and is maintained as an updated version of the initial printed report [Limbach,P.A., Crain,P.F. and McCloskey,J.A. (1994) Nucleic Acids Res. , 22, 2183-2196]. Information provided for each nucleoside includes: the type of RNA in which it occurs and phylogenetic distribution; common chemical name and symbol; Chemical Abstracts registry number and index name; chemical structure; initial literature citations for structural characterization or occurrence, and for chemical synthesis. The data are available through the World Wide Web at: http://www-medlib.med.utah/RNAmods/RNAmods .html
REPAIRtoire is the first comprehensive database resource for systems biology of DNA damage and repair. The database collects and organizes the following types of information: (i) DNA damage linked to environmental mutagenic and cytotoxic agents, (ii) pathways comprising individual processes and enzymatic reactions involved in the removal of damage, (iii) proteins participating in DNA repair and (iv) diseases correlated with mutations in genes encoding DNA repair proteins. REPAIRtoire provides also links to publications and external databases. REPAIRtoire contains information about eight main DNA damage checkpoint, repair and tolerance pathways: DNA damage signaling, direct reversal repair, base excision repair, nucleotide excision repair, mismatch repair, homologous recombination repair, nonhomologous end-joining and translesion synthesis. The pathway/protein dataset is currently limited to three model organisms: Escherichia coli, Saccharomyces cerevisiae and Homo sapiens. The DNA repair and tolerance pathways are represented as graphs and in tabular form with descriptions of each repair step and corresponding proteins, and individual entries are cross-referenced to supporting literature and primary databases. REPAIRtoire can be queried by the name of pathway, protein, enzymatic complex, damage and disease. In addition, a tool for drawing custom DNA–protein complexes is available online. REPAIRtoire is freely available and can be accessed at http://repairtoire.genesilico.pl/.