One of the most significant scientific events in the year 2001 was the publication of the initial sequence and analysis of the human genome resulting from both public (1
) and private sector (2
) efforts. With these publications, we have entered into a new era for modern biology, one where the majority of biological and biomedical research being conducted will use sequence data as its basic underpinning. Having such a rich source of information will prove invaluable for basic researchers whose findings will, in time, lead to improved strategies for the diagnosis, treatment and prevention of diseases having a genetic basis. In short, the stage has been set for genetic medicine having a prominent role in the delivery of healthcare in the future (3
A number of significant insights have already been made into the secrets hidden within the 3 billion bases that comprise the human genome (1
). There is marked variation in the distribution of features such as genes, transposable elements, GC content, CpG islands and recombination rate; this uneven distribution may provide important clues about the functions of these features and how they may be involved in regulation. There is a preferential retention of Alu
elements in GC-rich regions, correlating them (in a loose sense) with actively-transcribed genes. These elements may actually turn out to not be just ‘junk DNA’, instead providing a tangible benefit to their human hosts. In general, repetitive elements may not have a direct function per se
, but may influence chromosome structure. Probably the most telling finding is that the total number of genes in the human genome is only in the order of 30 000 to 35 000. Previously, numbers in the 80 000 range (and as high as 140 000) had been put forward. While the new estimate in the number of genes gives the human about twice that seen in Caenorhabditis elegans
or in Drosophila
, the genes themselves have a more complex structure. This big down-estimate in the number of genes immediately brings into question the one gene–one protein hypothesis: we are now finding more and more examples of alternative splicing generating a larger number of protein products (consistent with a more complex gene structure), as well as cases where identical proteins can be used for different functions, depending on their compartmentalization (4
While the near-completion of human genome sequencing marks a significant milestone, there are many other sequence-based efforts currently underway that will have just as much impact on the scientific and medical community. The most eagerly-anticipated model organism map is that of the mouse. The most recent physical map released on the Ensembl web site (http://mouse.ensembl.org, September 2001) provides an estimated 95% coverage of the mouse genome, with 15 694 genes confirmed over 361 Mb. To the issue of human health, single nucleotide polymorphisms (SNPs) continue to be identified at a breakneck pace. Over 1 million SNPs have already been identified, and a random sampling chosen for validation shows that 95% of these are indeed both polymorphic and unique (http://snp.cshl.org/data/). SNP alleles can be used as genetic markers, and often, the SNP itself is the variant that causes or contributes to the risk of developing a particular genetic disorder. To increase the power of using SNPs as markers for human disease, efforts are currently under way to develop a haplotype map, where ‘blocks’ of SNPs (rather than individual SNPs) could be used to find chromosomal regions associated with disease.
The sequence data that has been generated by these and other systematic sequencing projects can be browsed and downloaded from a variety of Web sites, with the major portals being located at NCBI (http://www.ncbi.nlm.nih.gov), Ensembl (http://www.ensembl.org) and UCSC (http://genome.cse.ucsc.edu). The problem that many investigators encounter, however, is that these larger databases often do not contain specialized information that would be of interest to specific groups within the scientific community. Many such databases have emerged to fill the void, and these databases often provide not just sequence-based information, but data such as phenotypes, experimental conditions, strain crosses and map features, data that might not fit neatly onto a large physical map of a genome. Most importantly, data in these smaller databases tend to be curated by experts in a particular speciality and are often experimentally-verified, meaning that they represent the best state of knowledge in that particular area. The savvy user will, therefore, make use of both types of databases in their experimental planning and design. This journal has devoted its first issue over the last several years to documenting the availability and features of these specialized databases in order to better-serve its readership and to promote the use of these resources in the design and analysis of experiments. These reviewed databases are collectively listed in the Molecular Biology Database Collection.
The databases included in the current version of the Collection are shown in Table . This year, the total number of databases listed is 335, up from 281 the year before. Several new databases have been added to the Collection, while others that are no longer actively curated or no longer available have been removed. These databases all distinguish themselves by their approach to presenting the underlying data—for example, by adding new value to the underlying data by virtue of curation, by providing new types of data connections or by implementing other innovative approaches that facilitate biological discovery. The individual entries are classified by type, but the reader should recognize that the distinctions between these classes are often arbitrary, and that many of these databases provide more than one type of information to the user.
Molecular Biology Database Collection
In addition to the list presented in this paper, an electronic version of the Database Issue and Collection can be accessed online and is freely available to everyone, regardless of subscription status, at http://nar.oupjournals.org. While the list contains the databases described in the papers comprising the current issue, it should be immediately apparent to the reader that there are simply not enough pages in this journal to accommodate full-length, printed descriptions of all of the 335 databases featured here. To address this, the online version of the Collection now includes short summaries of many of the databases, the summaries having been provided directly by the investigators responsible for the individual databases. We have also asked contributors to point out new features of their databases in the Recent Developments section of their entry. It is hoped that this approach will provide the reader with an additional source of information that will facilitate finding and selecting the sources of data that would be of most value in addressing a specific biological problem. Contributors will be encouraged to keep their entries up-to-date.
Suggestions for the inclusion of additional database resources in this collection are encouraged and may be directed to the author (email@example.com).