The evidence that two molecules interact in a living cell is often inferred from multiple different experiments. Experimental data is captured in multiple repositories, but there is no simple way to assess the evidence of an interaction occurring in a cellular environment. Merging and scoring of data are commonly required operations after querying for the details of specific molecular interactions, to remove redundancy and assess the strength of accompanying experimental evidence. We have developed both a merging algorithm and a scoring system for molecular interactions based on the proteomics standard initiative–molecular interaction standards. In this manuscript, we introduce these two algorithms and provide community access to the tool suite, describe examples of how these tools are useful to selectively present molecular interaction data and demonstrate a case where the algorithms were successfully used to identify a systematic error in an existing dataset.
Experience in clinical practice and research in systems pharmacology suggested the limitations of the current one-drug-one-target paradigm in new drug discovery. Single-target drugs may not always produce desired physiological effects on the entire biological system, even if they have successfully regulated the activities of their designated targets. On the other hand, multicomponent therapy, in which two or more agents simultaneously interact with multiple targets, has attracted growing attention. Many drug combinations consisting of multiple agents have already entered clinical practice, especially in treating complex and refractory diseases. Drug combination database (DCDB), launched in 2010, is the first available database that collects and organizes information on drug combinations, with an aim to facilitate systems-oriented new drug discovery. Here, we report the second major release of DCDB (Version 2.0), which includes 866 new drug combinations (1363 in total), consisting of 904 distinctive components. These drug combinations are curated from ∼140 000 clinical studies and the food and drug administration (FDA) electronic orange book. In this update, DCDB collects 237 unsuccessful drug combinations, which may provide a contrast for systematic discovery of the patterns in successful drug combinations.
As one of the most important protein post-translational modifications, the reversible phosphorylation is critical for plants in regulating a variety of biological processes such as cellular metabolism, signal transduction and responses to environmental stress. Numerous efforts especially large-scale phosphoproteome profiling studies have been contributed to dissect the phosphorylation signaling in various plants, while a large number of phosphorylation events were identified. To provide an integrated data resource for further investigations, here we present a comprehensive database of dbPPT (database of Phosphorylation site in PlanTs, at http://dbppt.biocuckoo.org), which contains experimentally identified phosphorylation sites in proteins from plants. The phosphorylation sites in dbPPT were manually curated from the literatures, whereas datasets in other public databases were also integrated. In total, there were 82 175 phosphorylation sites in 31 012 proteins from 20 plant organisms in dbPPT, presenting a larger quantity of phosphorylation sites and a higher coverage of plant species in comparison with other databases. The proportions of residue types including serine, threonine and tyrosine were 77.99, 17.81 and 4.20%, respectively. All the phosphoproteins and phosphorylation sites in the database were critically annotated. Since the phosphorylation signaling in plants attracted great attention recently, such a comprehensive resource of plant protein phosphorylation can be useful for the research community.
Plant essential oils are complex mixtures of volatile organic compounds, which play indispensable roles in the environment, for the plant itself, as well as for humans. The potential biological information stored in essential oil composition data can provide an insight into the silent language of plants, and the roles of these chemical emissions in defense, communication and pollinator attraction. In order to decipher volatile profile patterns from a global perspective, we have developed the ESSential OIL DataBase (EssOilDB), a continually updated, freely available electronic database designed to provide knowledge resource for plant essential oils, that enables one to address a multitude of queries on volatile profiles of native, invasive, normal or stressed plants, across taxonomic clades, geographical locations and several other biotic and abiotic influences. To our knowledge, EssOilDB is the only database in the public domain providing an opportunity for context based scientific research on volatile patterns in plants. EssOilDB presently contains 123 041 essential oil records spanning a century of published reports on volatile profiles, with data from 92 plant taxonomic families, spread across diverse geographical locations all over the globe. We hope that this huge repository of VOCs will facilitate unraveling of the true significance of volatiles in plants, along with creating potential avenues for industrial applications of essential oils. We also illustrate the use of this database in terpene biology and show how EssOilDB can be used to complement data from computational genomics to gain insights into the diversity and variability of terpenoids in the plant kingdom. EssOilDB would serve as a valuable information resource, for students and researchers in plant biology, in the design and discovery of new odor profiles, as well as for entrepreneurs—the potential for generating consumer specific scents being one of the most attractive and interesting topics in the cosmetic industry.
Database URL: http://nipgr.res.in/Essoildb/
Systems biology analysis of protein modules is important for understanding the functional relationships between proteins in the interactome. Here, we present a comprehensive database named AIM for Arabidopsis (Arabidopsis thaliana) interactome modules. The database contains almost 250 000 modules that were generated using multiple analysis methods and integration of microarray expression data. All the modules in AIM are well annotated using multiple gene function knowledge databases. AIM provides a user-friendly interface for different types of searches and offers a powerful graphical viewer for displaying module networks linked to the enrichment annotation terms. Both interactive Venn diagram and power graph viewer are integrated into the database for easy comparison of modules. In addition, predicted interologs from other plant species (homologous proteins from different species that share a conserved interaction module) are available for each Arabidopsis module. AIM is a powerful systems biology platform for obtaining valuable insights into the function of proteins in Arabidopsis and other plants using the modules of the Arabidopsis interactome.
β-Lactam antibiotics are among the most widely used antibiotics against microbial pathogens. However, enzymatic hydrolysis of these antibiotics by bacterial β-lactamases is increasingly compromising their efficiency. Although new generation β-lactam antibiotics have been developed to combat antibiotic resistance, β-lactamases have also evolved along with the new variants of the substrate. A strong selection pressure from the newer generation of β-lactam antibiotics has resulted in evolution of different families within each class of β-lactamase. To facilitate detailed characterization of different families of β-lactamases, we have created a database, CBMAR, which facilitates comprehensive molecular annotation and discovery of novel β-lactamases. As against the limited scope of other existing similar databases, CBMAR provides information useful for molecular and biochemical characterization of each family of β-lactamase. The basic architecture of CBMAR is based on Ambler classification, which divides β-lactamases as serine (Classes A, C and D) and metallo-β-lactamases (Class B). Each class is further divided into several families on the basis of their hydrolytic character. In CBMAR, each family is annotated with (i) sequence variability, (ii) antibiotic resistance profile, (iii) inhibitor susceptibility, (iv) active site, (v) family fingerprints, (vi) mutational profile, (vii) variants, (viii) gene location, (ix) phylogenetic tree and several other features. Each entry also has external links to the relevant protein/nucleotide sequence and structure databases. The database also supports sequence similarity searches using BLAST and assigns a new β-lactamase protein to its respective family on the basis of family-specific fingerprint.
Understanding prokaryotic transformation of recalcitrant pollutants and the in-situ metabolic nets require the integration of massive amounts of biological data. Decades of biochemical studies together with novel next-generation sequencing data have exponentially increased information on aerobic aromatic degradation pathways. However, the majority of protein sequences in public databases have not been experimentally characterized and homology-based methods are still the most routinely used approach to assign protein function, allowing the propagation of misannotations. AromaDeg is a web-based resource targeting aerobic degradation of aromatics that comprises recently updated (September 2013) and manually curated databases constructed based on a phylogenomic approach. Grounded in phylogenetic analyses of protein sequences of key catabolic protein families and of proteins of documented function, AromaDeg allows query and data mining of novel genomic, metagenomic or metatranscriptomic data sets. Essentially, each query sequence that match a given protein family of AromaDeg is associated to a specific cluster of a given phylogenetic tree and further function annotation and/or substrate specificity may be inferred from the neighboring cluster members with experimentally validated function. This allows a detailed characterization of individual protein superfamilies as well as high-throughput functional classifications. Thus, AromaDeg addresses the deficiencies of homology-based protein function prediction, combining phylogenetic tree construction and integration of experimental data to obtain more accurate annotations of new biological data related to aerobic aromatic biodegradation pathways. We pursue in future the expansion of AromaDeg to other enzyme families involved in aromatic degradation and its regular update.
Halophilic archaea/bacteria adapt to different salt concentration, namely extreme, moderate and low. These type of adaptations may occur as a result of modification of protein structure and other changes in different cell organelles. Thus proteins may play an important role in the adaptation of halophilic archaea/bacteria to saline conditions. The Halophile protein database (HProtDB) is a systematic attempt to document the biochemical and biophysical properties of proteins from halophilic archaea/bacteria which may be involved in adaptation of these organisms to saline conditions. In this database, various physicochemical properties such as molecular weight, theoretical pI, amino acid composition, atomic composition, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (Gravy) have been listed. These physicochemical properties play an important role in identifying the protein structure, bonding pattern and function of the specific proteins. This database is comprehensive, manually curated, non-redundant catalogue of proteins. The database currently contains 59 897 proteins properties extracted from 21 different strains of halophilic archaea/bacteria. The database can be accessed through link.
Database URL: http://webapp.cabgrid.res.in/protein/
piRNAs are a class of small RNAs that is most abundantly expressed in the animal germ line. Presently, substantial research is going on to reveal the functions of piRNAs in the epigenetic and post-transcriptional regulation of transposons and genes. A piRNA database for collection, annotation and structuring of these data will be a valuable contribution to the field, and we have therefore developed the piRBase platform which integrates various piRNA-related high-throughput data. piRBase has the largest collection of piRNAs among existing databases, and contains at present 77 million piRNA sequences from nine organisms. Repeat-derived and gene-derived piRNAs, which possibly participate in the regulation of the corresponding elements, have been given particular attention. Furthermore, epigenetic data and reported piRNA targets were also collected. To our knowledge, this is the first piRNA database that systematically integrates epigenetic and post-transcriptional regulation data to support piRNA functional analysis. We believe that piRBase will contribute to a better understanding of the piRNA functions.
microRNA (miRNA)–promoter interaction resource (microPIR) is a public database containing over 15 million predicted miRNA target sites located within human promoter sequences. These predicted targets are presented along with their related genomic and experimental data, making the microPIR database the most comprehensive repository of miRNA promoter target sites. Here, we describe major updates of the microPIR database including new target predictions in the mouse genome and revised human target predictions. The updated database (microPIR2) now provides ∼80 million human and 40 million mouse predicted target sites. In addition to being a reference database, microPIR2 is a tool for comparative analysis of target sites on the promoters of human–mouse orthologous genes. In particular, this new feature was designed to identify potential miRNA–promoter interactions conserved between species that could be stronger candidates for further experimental validation. We also incorporated additional supporting information to microPIR2 such as nuclear and cytoplasmic localization of miRNAs and miRNA–disease association. Extra search features were also implemented to enable various investigations of targets of interest.
Database URL: http://www4a.biotec.or.th/micropir2
The Protein Data Bank (PDB) is the single global repository for three-dimensional structures of biological macromolecules and their complexes, and its more than 100 000 structures contain more than 20 000 distinct ligands or small molecules bound to proteins and nucleic acids. Information about these small molecules and their interactions with proteins and nucleic acids is crucial for our understanding of biochemical processes and vital for structure-based drug design. Small molecules present in a deposited structure may be attached to a polymer or may occur as a separate, non-covalently linked ligand. During curation of a newly deposited structure by wwPDB annotation staff, each molecule is cross-referenced to the PDB Chemical Component Dictionary (CCD). If the molecule is new to the PDB, a dictionary description is created for it. The information about all small molecule components found in the PDB is distributed via the ftp archive as an external reference file. Small molecule annotation in the PDB also includes information about ligand-binding sites and about covalent and other linkages between ligands and macromolecules. During the remediation of the peptide-like antibiotics and inhibitors present in the PDB archive in 2011, it became clear that additional annotation was required for consistent representation of these molecules, which are quite often composed of several sequential subcomponents including modified amino acids and other chemical groups. The connectivity information of the modified amino acids is necessary for correct representation of these biologically interesting molecules. The combined information is made available via a new resource called the Biologically Interesting molecules Reference Dictionary, which is complementary to the CCD and is now routinely used for annotation of peptide-like antibiotics and inhibitors.
Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.
Motivation: We examine the task of temporal relation classification for the clinical domain. Our approach to this task departs from existing ones in that it is (i) ‘knowledge-rich’, employing sophisticated knowledge derived from discourse relations as well as both domain-independent and domain-dependent semantic relations, and (ii) ‘hybrid’, combining the strengths of rule-based and learning-based approaches. Evaluation results on the i2b2 Clinical Temporal Relations Challenge corpus show that our approach yields a 17–24% and 8–14% relative reduction in error over a state-of-the-art learning-based baseline system when gold-standard and automatically identified temporal relations are used, respectively.
Insects are one of the most successful animal groups on earth. Some insects, such as the silkworm and honeybee, are beneficial to humans, whereas others are notorious pests of crops. At present, the genomes of 38 insects have been sequenced and made publically available. In addition, the transcriptomes of dozens of insects have been sequenced. As gene data rapidly accumulate, constructing the pathway of molecular interactions becomes increasingly important for entomological research. Here, we developed an improved tool, iPathCons, for knowledge-based construction of pathways from the transcriptomes or the official gene sets of genomes. Considering the high evolution diversity in insects, iPathCons uses a voting system for Kyoto Encyclopedia of Genes and Genomes Orthology assignment. Both stand-alone software and a web server of iPathCons are provided. Using iPathCons, we constructed the pathways of molecular interactions of 52 insects, including 37 genome-sequenced and 15 transcriptome-sequenced ones. These pathways are available in the iPathDB, which provides searches, web server, data downloads, etc. This database will be highly useful for the insect research community.
The polyproline helix type II (PPII) is a regular protein secondary structure with remarkable features. Many studies have highlighted different crucial biological roles supported by this local conformation, e.g. in the interactions between biological macromolecules. Although PPII is less frequently present than regular secondary structures such as canonical alpha helices and beta strands, it corresponds to 3–10% of residues. Up to now, PPII is not assigned by most popular assignment tools, and therefore, remains insufficiently studied. PolyprOnline database is, therefore, dedicated to PPII structure assignment and analysis to facilitate the study of PPII structure and functional roles. This database is freely accessible from www.dsimb.inserm.fr/dsimb_tools/polyproline.
As a model organism database, Xenbase has been providing informatics and genomic data on Xenopus (Silurana) tropicalis and Xenopus laevis frogs for more than a decade. The Xenbase database contains curated, as well as community-contributed and automatically harvested literature, gene and genomic data. A GBrowse genome browser, a BLAST+ server and stock center support are available on the site. When this resource was first built, all software services and components in Xenbase ran on a single physical server, with inherent reliability, scalability and inter-dependence issues. Recent advances in networking and virtualization techniques allowed us to move Xenbase to a virtual environment, and more specifically to a private cloud. To do so we decoupled the different software services and components, such that each would run on a different virtual machine. In the process, we also upgraded many of the components. The resulting system is faster and more reliable. System maintenance is easier, as individual virtual machines can now be updated, backed up and changed independently. We are also experiencing more effective resource allocation and utilization.
Simple sequence repeats (SSRs) are regions in DNA sequence that contain repeating motifs of length 1–6 nucleotides. These repeats are ubiquitously present and are found in both coding and non-coding regions of genome. A total of 534 complete chloroplast genome sequences (as on 18 September 2014) of Viridiplantae are available at NCBI organelle genome resource. It provides opportunity to mine these genomes for the detection of SSRs and store them in the form of a database. In an attempt to properly manage and retrieve chloroplastic SSRs, we designed ChloroSSRdb which is a relational database developed using SQL server 2008 and accessed through ASP.NET. It provides information of all the three types (perfect, imperfect and compound) of SSRs. At present, ChloroSSRdb contains 124 430 mined SSRs, with majority lying in non-coding region. Out of these, PCR primers were designed for 118 249 SSRs. Tetranucleotide repeats (47 079) were found to be the most frequent repeat type, whereas hexanucleotide repeats (6414) being the least abundant. Additionally, in each species statistical analyses were performed to calculate relative frequency, correlation coefficient and chi-square statistics of perfect and imperfect SSRs. In accordance with the growing interest in SSR studies, ChloroSSRdb will prove to be a useful resource in developing genetic markers, phylogenetic analysis, genetic mapping, etc. Moreover, it will serve as a ready reference for mined SSRs in available chloroplast genomes of green plants.
The publicly available online database MelGene provides a comprehensive, regularly updated, collection of data from genetic association studies in cutaneous melanoma (CM), including random-effects meta-analysis results of all eligible polymorphisms. The updated database version includes data from 192 publications with information on 1114 significantly associated polymorphisms across 280 genes, along with new front-end and back-end capabilities. Various types of relationships between data are calculated and visualized as networks. We constructed 13 different networks containing the polymorphisms and the genes included in MelGene. We explored the derived network representations under the following questions: (i) are there nodes that deserve consideration regarding their network connectivity characteristics? (ii) What is the relation of either the genome-wide or nominally significant CM polymorphisms/genes with the ones highlighted by the network representation? We show that our network approach using the MelGene data reveals connections between statistically significant genes/ polymorphisms and other genes/polymorphisms acting as ‘hubs’ in the reconstructed networks. To the best of our knowledge, this is the first database containing data from a comprehensive field synopsis and systematic meta-analyses of genetic polymorphisms in CM that provides user-friendly tools for in-depth molecular network visualization and exploration. The proposed network connections highlight potentially new loci requiring further investigation of their relation to melanoma risk.
Database URL: http://www.melgene.org.
Viral microRNAs (miRNAs) regulate gene expression of viral and/or host genes to benefit the virus. Hence, miRNAs play a key role in host–virus interactions and pathogenesis of viral diseases. Lately, miRNAs have also shown potential as important targets for the development of novel antiviral therapeutics. Although several miRNA and their target repositories are available for human and other organisms in literature, but a dedicated resource on viral miRNAs and their targets are lacking. Therefore, we have developed a comprehensive viral miRNA resource harboring information of 9133 entries in three subdatabases. This includes 1308 experimentally validated miRNA sequences with their isomiRs encoded by 44 viruses in viral miRNA ‘VIRmiRNA’ and 7283 of their target genes in ‘VIRmiRtar’. Additionally, there is information of 542 antiviral miRNAs encoded by the host against 24 viruses in antiviral miRNA ‘AVIRmir’. The web interface was developed using Linux-Apache-MySQL-PHP (LAMP) software bundle. User-friendly browse, search, advanced search and useful analysis tools are also provided on the web interface. VIRmiRNA is the first specialized resource of experimentally proven virus-encoded miRNAs and their associated targets. This database would enhance the understanding of viral/host gene regulation and may also prove beneficial in the development of antiviral therapeutics.
Database URL: http://crdd.osdd.net/servers/virmirna
With the recent progress in complete genome sequencing, mining the increasing amount of genomic information available should in theory provide the means to discover new classes of peptides. However, annotation pipelines often do not consider small reading frames likely to be expressed. BactPepDB, available online at http://bactpepdb.rpbs.univ-paris-diderot.fr, is a database that aims at providing an exhaustive re-annotation of all complete prokaryotic genomes—chromosomal and plasmid DNA—available in RefSeq for coding sequences ranging between 10 and 80 amino acids. The identified peptides are classified as (i) previously identified in RefSeq, (ii) entity-overlapping (intragenic) or intergenic, and (iii) potential pseudogenes—intergenic sequences corresponding to a portion of a previously annotated larger gene. Additional information is related to homologs within order, predicted signal sequence, transmembrane segments, disulfide bonds, secondary structure, and the existence of a related 3D structure in the Protein Databank. As a result, BactPepDB provides insights about candidate peptides, and provides information about their conservation, together with some of their expected biological/structural features. The BactPepDB interface allows to search for candidate peptides in the database, or to search for peptides similar to a query, according to the multiple properties predicted or related to genomic localization.
Microorganisms are known to counteract salt stress through salt influx or by the accumulation of osmoprotectants (also called compatible solutes). Understanding the pathways that synthesize and/or breakdown these osmoprotectants is of interest to studies of crops halotolerance and to biotechnology applications that use microbes as cell factories for production of biomass or commercial chemicals. To facilitate the exploration of osmoprotectants, we have developed the first online resource, ‘Dragon Explorer of Osmoprotection associated Pathways’ (DEOP) that gathers and presents curated information about osmoprotectants, complemented by information about reactions and pathways that use or affect them. A combined total of 141 compounds were confirmed osmoprotectants, which were matched to 1883 reactions and 834 pathways. DEOP can also be used to map genes or microbial genomes to potential osmoprotection-associated pathways, and thus link genes and genomes to other associated osmoprotection information. Moreover, DEOP provides a text-mining utility to search deeper into the scientific literature for supporting evidence or for new associations of osmoprotectants to pathways, reactions, enzymes, genes or organisms. Two case studies are provided to demonstrate the usefulness of DEOP. The system can be accessed at.
MicroRNAs (miRNAs) are a class of small (∼22 nucleotides) non-coding RNAs that post-transcriptionally regulate gene expression by interacting with target mRNAs. A majority of miRNAs is located within intronic or exonic regions of protein-coding genes (host genes), and increasing evidence suggests a functional relationship between these miRNAs and their host genes. Here, we introduce miRIAD, a web-service to facilitate the analysis of genomic and structural features of intragenic miRNAs and their host genes for five species (human, rhesus monkey, mouse, chicken and opossum). miRIAD contains the genomic classification of all miRNAs (inter- and intragenic), as well as classification of all protein-coding genes into host or non-host genes (depending on whether they contain an intragenic miRNA or not). We collected and processed public data from several sources to provide a clear visualization of relevant knowledge related to intragenic miRNAs, such as host gene function, genomic context, names of and references to intragenic miRNAs, miRNA binding sites, clusters of intragenic miRNAs, miRNA and host gene expression across different tissues and expression correlation for intragenic miRNAs and their host genes. Protein–protein interaction data are also presented for functional network analysis of host genes. In summary, miRIAD was designed to help the research community to explore, in a user-friendly environment, intragenic miRNAs, their host genes and functional annotations with minimal effort, facilitating hypothesis generation and in-silico validations.
MetaProx is the database of metagenomic proximons: a searchable repository of proximon objects conceived with two specific goals. The first objective is to accelerate research involving metagenomic functional interactions by providing a database of metagenomic operon candidates. Proximons represent a special subset of directons (series of contiguous co-directional genes) where each member gene is in close proximity to its neighbours with respect to intergenic distance. As a result, proximons represent significant operon candidates where some subset of proximons is the set of true metagenomic operons. Proximons are well suited for the inference of metagenomic functional networks because predicted functional linkages do not rely on homology-dependent information that is frequently unavailable in metagenomic scenarios. The second objective is to explore representations for semistructured biological data that can offer an alternative to the traditional relational database approach. In particular, we use a serialized object implementation and advocate a Data as Data policy where the same serialized objects can be used at all levels (database, search tool and saved user file) without conversion or the use of human-readable markups. MetaProx currently includes 4 210 818 proximons consisting of 8 926 993 total member genes.