|Home | About | Journals | Submit | Contact Us | Français|
The current 18th Database Issue of Nucleic Acids Research features descriptions of 96 new and 83 updated online databases covering various areas of molecular biology. It includes two editorials, one that discusses COMBREX, a new exciting project aimed at figuring out the functions of the ‘conserved hypothetical’ proteins, and one concerning BioDBcore, a proposed description of the ‘minimal information about a biological database’. Papers from the members of the International Nucleotide Sequence Database collaboration (INSDC) describe each of the participating databases, DDBJ, ENA and GenBank, principles of data exchange within the collaboration, and the recently established Sequence Read Archive. A testament to the longevity of databases, this issue includes updates on the RNA modification database, Definition of Secondary Structure of Proteins (DSSP) and Homology-derived Secondary Structure of Proteins (HSSP) databases, which have not been featured here in >12 years. There is also a block of papers describing recent progress in protein structure databases, such as Protein DataBank (PDB), PDB in Europe (PDBe), CATH, SUPERFAMILY and others, as well as databases on protein structure modeling, protein–protein interactions and the organization of inter-protein contact sites. Other highlights include updates of the popular gene expression databases, GEO and ArrayExpress, several cancer gene databases and a detailed description of the UK PubMed Central project. The Nucleic Acids Research online Database Collection, available at: http://www.oxfordjournals.org/nar/database/a/, now lists 1330 carefully selected molecular biology databases. The full content of the Database Issue is freely available online at the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
This current, 18th annual Database Issue of Nucleic Acids Research (NAR) features descriptions of 96 new (Table 1) online databases covering a variety of molecular biology data and 83 data resources that have previously been published in NAR or other journals. The accompanying NAR online Molecular Biology Database Collection (http://www.oxfordjournals.org/nar/database/a/) now includes 1330 data sources.
In addition to this editorial comment, the current issue includes two more editorials. The first of them (1) is a collective statement by a large consortium of scientists, including the authors of this article, who are concerned with the proliferation of new databases that are rarely able to talk to each other. As a result, instead of contributing to building a single body of knowledge, these databases risk functioning increasingly as isolated islands in a sea of disparate biological data. This article proposes creating a community-defined, uniform, generic description of the core attributes of biological databases, BioDBcore, a kind of ‘minimal information about a biological database’, and provides a preliminary checklist to describe basic specifications of each new database (1). We would ask the authors of future submissions to the NAR Database Issue to fill out that checklist (or its latest version posted at http://biocurator.org/biodbcore.shtml) and provide it as Supplementary Data to their manuscripts. In addition, we will explore ways in which the NAR online Molecular Biology Database Collection might ultimately support the standard.
Another editorial (2) describes COMBREX, an exciting project that is aimed at figuring out the functions of the ‘conserved hypothetical’ and poorly or incorrectly annotated proteins, identified through genome sequencing [see also refs (3,4)]. This project is designed to serve as a clearinghouse, collecting functional predictions from specialists in bioinformatics and functional genomics and then sending these predictions for testing by experimentalists. COMBREX offers an entirely new arrangement for research funding, whereby relatively small amounts of money are offered on a competitive basis to the experimental groups that are willing to test those predictions, employing the techniques and equipment that already exist in their laboratories. This arrangement dramatically decreases the costs of functional analysis of the uncharacterized proteins and gives hope that many of them could be assigned a biochemical—and/or general biological—function.
A bright example of databases that do talk to each other is the International Nucleotide Sequence Database Collaboration (INSDC), which consists of three participating databases, the DNA Data Bank of Japan (DDBJ), the European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EMBL-EBI), and GenBank at the US National Center for Biotechnology Information (NCBI). This issue features separate papers from each of these three databases (5–7), as well as a joint paper describing the principles of data maintenance and exchange within the collaboration (8). A separate paper describes the functioning of the Sequence Read Archive (SRA), recently established by the three INSDC partners (9).
Another area where database collaboration proved extremely successful is storage and dissemination of published research. This issue features a detailed description of the UK PubMed Central, an extremely important project that, in collaboration with PubMed Central projects in USA and Canada, provides a permanent online record for the research sponsored by British funding agencies, such as MRC, BBSRC, Wellcome Trust and the National Institute for Health Research (10).
In addition to the archival databases such as those of the INSDC, this issue includes curated databases of DNA sequence motifs, such as AREsite, a collection of AU-rich elements in vertebrate mRNA UTR sequences, and non-B DB, a repository of DNA sequences that form cruciform, triplex, slipped (hairpin) structures, tetraplex (G-quadruplex), left-handed Z-DNA and other DNA structures (11,12).
The RNA database papers featured in this issue include updates on Rfam and miRBase, two gold-standard databases of RNA sequences (13,14), a description of lncRNAdb, a new resource on experimentally characterized long non-coding RNA (15), as well as descriptions of several databases of predicted and/or experimentally validated microRNA targets (16–21). This issue also includes an update on the status of the RNA Modification Database, which was regularly featured in the NAR Database Issue in the 1990s (22–25) but not in the past 12 years. The current version lists 107 types of posttranscriptional modifications of nucleosides in RNA, primarily in various tRNAs (26). Two new databases present data on the RNA-binding proteins [RBPDB, http://rbpdb.ccbr.utoronto.ca/ (27)] and the specific structures of their RNA-binding sites [PRIDB, http://bindr.gdcb.iastate.edu/PRIDB (28)].
This issue also features a block of 15 papers describing recent progress in protein structure databases, such as Protein DataBank (PDB), PDB in Europe (PDBe), CATH, SUPERFAMILY (29–32), as well as a selection of databases on protein building blocks, protein–protein interactions, protein structure modeling, and the organization of inter-protein contact sites (33–38). Among new databases, it is worth mentioning EMDataBank.org, a database of 3D cryo-electron microscopy maps (39), a database of protein circular dichroism data (40) and three databases that are dedicated to the conformational dynamics of proteins (41–43). In addition, a paper from Gert Vriend’s group (44) presents their PDB-facilities web site with several useful PDB-derived databases for the analysis of protein structures. These include the famous Definition of Secondary Structure of Proteins (DSSP) and Homology-derived Secondary Structure of Proteins (HSSP) databases, which were last featured in the NAR Database Issue >12 years ago (45,46).
Progress in the analysis of the human genome prompted the creation of databases that list genes implicated in a variety of human diseases, including coronary artery disease (47), type I diabetes (48) and cancer. Cancer databases in this issue are represented by an update paper on the Catalogue of Somatic Mutations In Cancer [COSMIC, http://www.sanger.ac.uk/cosmic (49)], a description of the University of California Santa Cruz (UCSC) Cancer Genomics Browser [http://genome-cancer.cse.ucsc.edu (50)], a new resource tightly integrated with the popular UCSC Genome Browser and the ENCODE database (51,52), and three more databases, dedicated, respectively, to cervical cancer, prostate cancer and potential cancer drug targets (53–55).
There are many other excellent databases that could not be mentioned here because of the space restrictions. In fact, we expect every single database featured in this issue to be useful to a wide audience of students and researchers in various areas of molecular biology.
As explained in last year’s editorial (56), moving to an online-only format for the NAR Database Issue has allowed us to accommodate longer papers and to offer the authors of the most popular data resources an opportunity to describe their resources in more detail, providing a deeper insight into the organization and goals of their respective resources and putting the recent updates of these resources into a broader context. This year, such extended papers were invited for a much larger number of databases, resulting in comprehensive descriptions of the PDB, PDBe, EMDataBank, MODBASE, GPCRDB, RegulonDB, STRING and other well-known databases (29,30,35,39,57–59). In some cases, longer descriptions were accepted for first-time descriptions of several new databases (36,60,61). We intend to continue accepting long(er) database papers in the future.
Intramural Research Program of the US National Institutes of Health (to M.Y.G.); European Molecular Biology Laboratory (to G.R.C.). Funding for open access charge: Waived by Oxford University Press.
Conflict of interest statement. The authors' opinions do not necessarily reflect the views of their respective institutions.
The authors thank Sir Richard Roberts and Dr Alex Bateman, Dr David Landsman and Dr Francis Ouellette for helpful comments; Patricia Anderson, Dr Martine Bernardes-Silva and Gail Welsh for excellent editorial assistance, the Oxford University Press team lead by Claire Bird and Jennifer Boyd and Sheila Plaister at EMBL-EBI for their help in compiling this issue and the online Molecular Biology Database Collection.