|Home | About | Journals | Submit | Contact Us | Français|
The 20th annual Database Issue of Nucleic Acids Research includes 176 articles, half of which describe new online molecular biology databases and the other half provide updates on the databases previously featured in NAR and other journals. This year’s highlights include two databases of DNA repeat elements; several databases of transcriptional factors and transcriptional factor-binding sites; databases on various aspects of protein structure and protein–protein interactions; databases for metagenomic and rRNA sequence analysis; and four databases specifically dedicated to Escherichia coli. The increased emphasis on using the genome data to improve human health is reflected in the development of the databases of genomic structural variation (NCBI’s dbVar and EBI’s DGVa), the NIH Genetic Testing Registry and several other databases centered on the genetic basis of human disease, potential drugs, their targets and the mechanisms of protein–ligand binding. Two new databases present genomic and RNAseq data for monkeys, providing wealth of data on our closest relatives for comparative genomics purposes. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and currently lists 1512 online databases. The full content of the Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).
This 1300-page virtual volume represents the 20th annual Database Issue of Nucleic Acids Research (NAR). It includes descriptions of 88 new online databases, 77 update articles on databases that have been previously featured in the NAR Database Issue (Table 1) and 11 articles with updates on database resources whose descriptions have been previously published in other journals (Table 2).
At this point it might be instructive to look back at the origin and evolution of the NAR Database Issue. Its history started from two supplementary issues that were published in NAR in April of 1991 and in May of 1992 and consisted of 18 and 19 articles, respectively (see http://nar.oxfordjournals.org/content/19/supplement.toc and http://nar.oxfordjournals.org/content/20/supplement.toc). These articles offered descriptions of several nucleotide sequence databases, such as GenBank, the EMBL Data Library, compilations of small RNA, tRNA, and 5S, 16S, and 23S rRNA sequences (including the Ribosomal Database Project), DNA sequences from Escherichia coli and a human genome database (GDB). Those first issues also included descriptions of several protein databases, such as Swiss-Prot, PIR, Prosite, Restriction Enzyme Database (REBASE), Transcription Factors Database (TFD) and Histone database. There was also a medical genetics database, Haemophilia B, listing point mutations and indels in the coagulation factor IX (F9) gene that caused this blood clotting disorder, which has affected the royal families of several European countries.
The next issue, published on July 1, 1993, was the first one formally labelled as the Database Issue. It consisted of 24 articles, which added databases of RNA and protein structure and the Enzyme database. It was followed by NAR Database Issues in September 1994, then in January 1996, and each January after that.
In the past 20 years, the Database Issue has gradually grown in size before stabilizing at the level of ~180 articles. However, despite the almost 10-fold increase in the number of published articles, the key topics of the current issue remain largely the same as 20 years ago. This issue again features articles from GenBank and the European Nucleotide Archive (formerly the EMBL Data Library), which, together with the DNA Data Bank of Japan, form the International Nucleotide Sequence Database collaboration, INSDC (1–4). Just as 20 years ago, there are updates from Swiss-Prot and PIR (now combined into UniProt) and Prosite (5,6).
Continuing the tradition of featuring well-curated databases of RNA sequences, this issue includes an update on SILVA, a widely used comprehensive database of bacterial, archaeal and eukaryotic 16S/18S and 23S/28S rRNA sequences (7), and a description of Protist Ribosomal Reference database (PR2), a new database that catalogs small subunit rRNA sequences from unicellular eukaryotes (8). An update on the Ribosomal Database Project, a constant feature of the NAR Database Issue since 1991 (9), was last published in 2009 (10). Other RNA databases in this issue include an update on Rfam (11), the universally acclaimed database of RNA families, as well as several databases on long non-coding RNA, microRNA and their targets. An update of Modomics, a database on RNA modification, is now supplemented by RNApathwaysDB, a database of RNA maturation and decay pathways developed by the same group (12,13).
As before, this issue presents several transcription factor (TF) databases. Two of them cover TFs themselves: TFClass offers a classification of human TFs, while NPIDB presents structural information on DNA–protein and RNA–protein complexes (14,15). Several other databases collect information on the TF-binding sites. These include Factorbook, a database of TF-binding data from the ENCODE project; HOCOMOCO, a collection of human TF-binding sites; CTCFBSDB, a database of CCCTC-binding factor (CTCF)-binding sites; RegulonDB, a database of transcriptional regulation in E. coli; and SwissRegulon, a database of regulatory sites in human, mouse and yeast genomes and in model bacteria (16–20).
The structural databases featured in this issue all show a trend towards a better integration and cross-referencing tools. This refers both to the updates of well-known databases, such as the RCSB Protein Data Bank (PDB), CATH and PDBTM, and to such databases as EBI’s SIFTS, a joint effort of UniProt and PDBe to provide a residue level mapping of their entries and supplement it with annotation from other public databases; Genome3D, a recent collaborative project aiming to provide structural annotation from CATH and SCOP to the genomic sequences; and dcGO, which develops domain-centric ontologies to link protein domains with functions, phenotypes and diseases (21–23).
Likewise, with E. coli remaining the workhorse of molecular biology, this issue includes update articles on the EcoGene (the first one since 2000), EcoCyc and RegulonDB databases, as well as a description of the newly developed E. coli Metabolome Database (20,24–26).
As discussed earlier (27), the original GDB did not survive the influx of the new data and multiple changes of ownership. Nevertheless, we now have a wide variety of databases that cover different aspects of human genome and genomes of model organisms. This issue features annual updates from Ensembl and ENCODE projects and from the UCSC Genome Browser and the Japanese H-InvDB database (28–31). The model organism databases are represented by the updates to FlyBase, Mouse Genome database, Xenbase and ZFIN (32–35).
Two new databases, RhesusBase and NHPRTR, present extensive genome and RNAseq data for non-human primates, including great apes, old world monkeys, new world monkeys and prosimians (36,37). These data could go a long way towards establishing monkeys as model organisms for comparative genomics studies. One more database is dedicated to a more distant relative of human, the urochordate Oikopleura dioica (38).
A potentially important development is the construction of two new databases of repetitive DNA elements, Dfam and SINEBase (39,40). Along with the industry standard Repbase Update (41,42) and monthly RepBase Reports (http://www.girinst.org/repbase/reports/), these databases promise to contribute to a better understanding of eukaryotic repeat elements.
With the abundance of databases providing valuable tools for genome analysis, there is a clear trend towards bringing genomics ‘from the bench to the bedside’, i.e. using genomic data for a better understanding and, hopefully, better treatment of human disease. A number of projects, including ClinSeq (http://www.genome.gov/20519355), DDD (http://www.ddduk.org/) and UK10K (http://www.uk10k.org/) are working towards these goals, and several databases featured in this issue represent important steps in this direction. Last year’s issue introduced the GWASdb database of human genetic variants identified by genome-wide association studies (43). GWAS Central, established in 2007 as HGVbaseG2P (44), has been revamped and now includes data from over 1000 studies. Now, a joint article from NCBI and EBI describes their databases of genomic structural variation, dbVar and DGVa (45). These databases cover diverse variation data including inversions, insertions and translocations that are >50 bp in length. NCBI is also developing ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/), a database of relationships between human gene variation and the observed health status (46). The task of streamlining the genetic tests that provide such information is taken up by the recently created NIH Genetic Testing Registry, a database of genetic tests and laboratories that perform them, with detailed information about what exactly is measured in each test and its analytic and clinical validity (47).
The impact of the genomic data on developing targeted approaches for fighting disease is particularly evident in the case of cancer. This issue features updates from three great databases, the UCSC Cancer Genome Browser (48), the Atlas of Genetics and Cytogenetics in Oncology and Haematology (49) and the TP53 website [(50), the first update of the database on tumor factor p53 mutations since 1997]. In addition, there are two new databases dedicated to studying cancer at the level of specific cell lines. The CellLineNavigator database provides gene expression profiles of different cancer cell lines in different pathological states (51), whereas the Genomics of Drug Sensitivity in Cancer (GDSC) collects the results of high-throughput studies examining the sensitivity for anti-cancer drugs in various cell lines (52).
During the past 20 years, all databases featured in the NAR Database Issues were added to the NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/. With the annual attrition rate of <5%, this Collection has been steadily growing and, in 2012, exceeded 1400 database entries (53). It was clear that the list was due for a serious clean-up, and one of the authors (XMFS) devised and set in motion a semi-automated procedure to identify obsolete and non-responsive websites. Remarkably, >90% of the databases listed in the last year’s release of the online Collection were found to be functional. Corresponding authors of close to a hundred non-responsive resources had been contacted and 44 websites (~3.2% of the total) have been approved for deletion. About 100 entries in the Collection have been updated by receiving corrected URLs, summaries highlighting recent developments, or some other changes in the deposited data.
Although deletion of 40 databases was well within the average drop-off rate and was hardly surprising, further analysis revealed that most of these resources were not lost. Instead, in the normal course of database evolution, they have been integrated into larger projects. For example, a couple of segmental duplications databases were merged into the Database of Genomic Variants (54), NAR Database Collection entry no. 655, while the NCBI’s Cancer Chromosomes database has been merged into dbVar [described in detail in this issue, (45)]. Further, improved annotation of the human genome made redundant a number of resources that covered specific areas of the genome (e.g. the IXDB with its physical maps of human chromosome X).
In one instance, the ExDom database of exon–intron structures of genes in seven eukaryotic genomes (55) had to be removed from the Collection, as it has taken the commercial route and does not provide a free version anymore, although the author’s company offered a discounted version for academic users. Unfortunately, the tightening budgets (56) might force other databases to follow the same path.
In total, the NAR online Molecular Biology Database Collection now includes 1512 databases sorted into 14 categories and 41 subcategories. The authors wishing to have their databases, published elsewhere, to be included in the Collection are welcome to contact XMFS directly.
Intramural Research Program of the U.S. National Institutes of Health at the National Library of Medicine [to M.Y.G.]. Funding for open access charge: Waived by Oxford University Press.
Conflict of interest statement. The authors' opinions do not necessarily reflect the views of their respective institutions.
The authors thank Drs Javier Herrero and Michael Schuster for helpful comments and the Oxford University Press team led by Jennifer Boyd and Andrew Malvern for their help in compiling this issue.