|Home | About | Journals | Submit | Contact Us | Français|
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s Web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, HomoloGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing, Human MapViewer, GeneMap’99, Human–Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, Cancer Genome Anatomy Project (CGAP), SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) and the Conserved Domain Database (CDD). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov.
The National Center for Biotechnology Information (NCBI) at the National Institutes of Health was created in 1988 to develop information systems for molecular biology. In addition to maintaining the GenBank® (1) nucleic acid sequence database, to which data is submitted directly by the scientific community, NCBI provides data retrieval systems and computational resources for the analysis of GenBank data and the variety of other biological data made available through NCBI.
The data accessible from NCBI’s home page (http://www.ncbi.nlm.nih.gov) runs the gamut from short sequences representative of parts of genes to complete genomes, protein structures and clinical descriptions of genetic disorders. NCBI offers an array of computational resources to aid in the analysis of each type of data. For this overview, the NCBI suite of database resources is grouped into seven categories: database retrieval systems, sequence similarity search programs, resources for analysis of gene-level sequences, resources for chromosomal sequences, resources for genome-scale analysis, resources for the analysis of gene expression and phenotypes, and resources for protein structure and modeling. Table Table11 provides an at-a-glance summary of these resources.
Entrez (2) is an integrated database retrieval system that accesses DNA and protein sequences, genome maps, population sets, protein structures from MMDB (3) and the biomedical literature via PubMed and Online Mendelian Inheritance in Man (OMIM), with embedded links to the NCBI taxonomy. The sequences in Entrez, especially protein sequences, are obtained from a variety of database sources [including GenBank protein translations, Protein Identification Resource (4), SWISS-PROT (5), Protein Research Foundation, Protein Data Bank (6) and RefSeq (7)], and therefore include more sequence data than GenBank alone. PubMed includes primarily the 10.7 million references and abstracts in MEDLINE®, with links to the full-text of more than 1100 journals available on the Web.
Entrez provides text searching of sequence or bibliographic records using simple Boolean queries, plus extensive links to related information. Some links are simple cross-references, for example, from a sequence to the abstract of the paper in which it was reported, from a protein sequence to its corresponding DNA sequence, or to alignments with other sequences. Other links are based on computed similarities among the sequences or MEDLINE abstracts. These pre-computed ‘neighbors’ allow rapid access for browsing groups of related records. A service called LinkOut expands the range of external links from individual database records to related outside services, including organism-specific genome databases.
The NCBI taxonomy database indexes over 79 000 organisms that are represented in the sequence databases with at least one nucleotide or protein sequence. The Taxonomy Browser can be used to view the taxonomic position or retrieve sequence and structural data for a particular organism or group of organisms. Searches of the NCBI taxonomy may be made on the basis of whole, partial or phonetically-spelled organism names, and direct links to organisms commonly used in biological research are also provided. The new Entrez Taxonomy system adds the ability to display custom taxonomic trees representing user-defined subsets of the full NCBI taxonomy.
The LocusLink database of official gene names and other gene identifiers, described elsewhere in this issue (7), was developed at NCBI in conjunction with several international collaborators, and offers a single query interface to curated sequences and descriptive information about genes.
The BLAST family of search programs (8,9) is provided for the most frequent type of analysis performed on GenBank, the sequence-similarity search. NCBI’s Web interface to the standard BLAST 2.1 program accepts either a sequence or accession number and performs the search using either an identity matrix for blastn (nucleotide) searches or a PAM or BLOSUM scoring matrix for protein searches. BLAST produces a set of gapped alignments, with links to the full document records, accompanied by an alignment score and a measure of statistical significance, called the Expectation Value, for judging the quality of the alignment. Web BLAST provides a graphical overview of the alignments, color-coded by alignment score, which clearly shows the extent and quality of sequence similarities, as well as the disposition of gaps in the alignments. Web BLAST can also generate a taxonomically organized output that emphasizes taxonomic patterns of sequence-similarity.
The default databases searched by BLAST are the non-redundant (nr) nucleotide and protein databases constructed from the Entrez databases. Several specialized databases may also be searched, and searches may be restricted to sequences from a particular organism. Query sequences may be filtered for low complexity or human repeats. Customized BLAST pages allow queries against finished human genomic data, microbial genomes or the genomes of malaria-associated pathogens.
Specialized versions of BLAST are offered for the needs of protein similarity searching. Position Specific Iterated BLAST (PSI-BLAST) (9) initially performs a conventional BLAST search to produce alignments from which it constructs a position specific score matrix (PSSM). Subsequent BLAST iterations use this PSSM to find similarities in the database. Pattern Hit Initiated BLAST (PHI-BLAST) (10) requires both a query sequence and a pattern present within the query sequence. The pattern specifies an obligatory match between query and database sequences, about which optimal local alignments are constructed. Another variant, ‘BLAST2Sequences’ (11), compares two DNA or protein sequences and produces a dot-plot representation of the alignments it reports.
Basic BLAST 2.0 searches can also be performed by email through the address: blast/at/ncbi.nlm.nih.gov. Documentation can be obtained by sending the word ‘help’ to the server address.
To manage the redundancy of the EST data, NCBI has developed UniGene (12), a system for automatically partitioning GenBank sequences, including ESTs, into a non-redundant set of gene-oriented clusters. There are currently five UniGene databases; for human, mouse, rat, zebrafish and cow. UniGene starts with entries in the appropriate organismic division of GenBank, combines these with ESTs of that organism and creates clusters of sequences that share virtually identical 3′ untranslated regions (3′ UTRs). Each UniGene cluster contains sequences that represent a unique gene, and is linked to related information, such as the tissue types in which the gene is expressed, model organism protein similarities, the LocusLink report for the gene and its map location. In the human UniGene database, over 1.8 million human ESTs in GenBank have been reduced 21-fold in number to approximately 84 000 sequence clusters. In a similar fashion, the mouse, rat, zebrafish, and cow ESTs have been organized as 73 000, 37 000, 10 000, and 5500 clusters, respectively. The human UniGene collection has been effectively used as a source of mapping candidates for the construction of a human gene map (13). In this case, the 3′ UTRs of genes and ESTs are converted to sequence-tagged sites (STSs) that are then placed on physical maps and integrated with pre-existing genetic maps of the genome. The UniGene collection has also been used as a source of unique sequences for the fabrication of ‘chips’ for the large-scale study of gene expression (14). UniGene databases are updated weekly with new EST sequences, and bimonthly with newly characterized sequences. UniGene clusters may be searched in several ways; by gene name, chromosomal location, cDNA library, accession number, and ordinary text words. Cluster sequences may also be downloaded by FTP.
HomoloGene is a database of both curated and calculated orthologs and homologs for the human, mouse, rat, zebrafish and cow genes represented in UniGene and LocusLink. Curated orthologs include gene pairs from the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon and from published reports. Computed orthologs and homologs, which are considered putative, are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. HomoloGene also contains a set of triplet ortholog clusters in which orthologous clusters in two organisms are both orthologous to the same cluster in a third organism. For the three organisms human, mouse and rat, there are currently over 7000 of these self-consistent triplets. The HomoloGene database can be queried using UniGene ClusterIDs, LocusLink LocusIDs, gene symbols, gene names and nucleotide accession numbers, as well as those terms in UniGene cluster titles. The current datasets for the calculated orthologs and homologs and the Mutually Orthologous Pairs are also available via FTP.
The References Sequence (RefSeq) database, described elsewhere in this issue (7), provides curated reference sequences for mRNAs and proteins from human and other organisms.
The database of Single Nucleotide Polymorphisms (dbSNP), described elsewhere in this issue (15), serves as a repository for both single base nucleotide substitutions and short deletion and insertion polymorphisms that are deposited by the research community.
ORF Finder performs a six-frame translation of a nucleotide query and returns a graphic that indicates the location of each open reading frame (ORF) found. Restrictions on the size of the ORFs returned may be set by the user. The sequences of predicted protein products can be submitted directly for BLAST similarity searching or searching against the COGs (see below) database.
PCR-based assays for STSs can be used for gene identification and mapping. Electronic PCR (e-PCR) is a tool for locating STSs within a nucleotide sequence by comparing the query against the dbSTS database of STS sequences and primer pairs. The e-PCR application accepts either an accession number or sequence as input, and returns a table of links to matching dbSTS records as well as the primer pairs used to amplify each STS identified.
The Human Genome Sequencing (16) site shows chromosome-specific progress of the human sequencing project, provides access to individual contigs and assemblies, and offers chromosome-specific BLAST searches. Links to contributing genome sequencing centers are also provided. Sequence data may be downloaded by contig or chromosome.
The Human Genome MapViewer can display the human genome data using up seven parallel chromosomal maps simultaneously. The maps displayed can be selected from a set of 19, and include cytogenetic maps, such as chromosomal ideograms, sequence-based maps, such as those showing contigs, genes, and SNPs, and radiation hybrid maps, such as the G3 and GB4 maps used to construct GeneMap ’99. Queries against the entire human genome or particular chromosomes can be made using gene names or symbols, marker names, SNP identifiers, accession numbers and other identifiers. The Human Genome MapViewer is tightly integrated with other NCBI databases such as LocusLink and dbSNP. A MapViewer similar to the Human Genome MapViewer is also used to display the Drosophila genome data.
An international consortium was formed in 1994 to construct a human gene map by determining the locations of ESTs relative to a framework of well-characterized genetic markers (17). The current version of this map is the radiation hybrid map, GeneMap’99 (13), featuring 30 261 unique gene loci.
The Human–Mouse Homology Maps are tables of genetic loci in homologous segments of DNA from human and the mouse. The map is computed by integrating orthologs curated by the Mouse Genome Database with putative orthologs identified by homology. The maps are linked to GeneMap’99, OMIM, LocusLink, dbSTS, BLAST2Sequences and the Mouse Genome Database at The Jackson Laboratory. Other mouse genome resources can be found on the Mouse Genome Sequencing page, analogous to the Human Genome Sequencing page described above.
The CCAP service is an initiative of the National Cancer Institute (NCI) and NCBI. The data includes a compilation by F. Mitelman, F. Mertens and B. Johansson of recurrent neoplasia-associated chromosomal aberrations from the Cancer Chromosome Aberration Bank at the University of Lund, Sweden (18). Also provided are bacterial artificial chromosome (BAC) human chromosome mapping data provided through CCAP’s fluorescent in situ hybridization (FISH) effort.
The Entrez Genomes database (19) provides access to genomic data contributed by the scientific community for over 900 species whose sequencing and mapping is complete or in progress, and now includes more than 30 complete microbial genomes. Also included is a collection of 169 reference sequences for the complete genomes of eukaryotic organelles. Data can be accessed hierarchically starting from either an alphabetical listing or a phylogenetic tree for complete genomes in each of six principle taxonomic groups. One can follow the hierarchy to a graphical overview for the genome of a single organism, on to the level of a single chromosome and, finally, down to the level of a single gene.
At each level are one or more views, pre-computed summaries and links to analyses appropriate for that level. For instance, at the level of a genome or a chromosome, a Coding Regions view displays the location of each coding region, length of the product, GenBank identification number for the protein sequence and name of the protein product. An RNA Genes view lists the location and gene names for ribosomal and transfer RNA genes. At the level of a single gene, links are provided to pre-computed sequence neighbors for the gene product. Any protein gene product that is a member of a COG (20) is linked to the COGs database. A summary of COG functional groups is also presented in tabular and graphical formats at the genome level.
For complete microbial genomes, pre-computed BLAST neighbors for protein sequences, including their taxonomic distribution and links to 3-D structures, are given in TaxTables and PDBTables, respectively. Pairwise sequence alignments are presented graphically and linked to the Cn3D macromolecular viewer (21), which allows the interactive display of 3-D structures and sequence alignments.
The COGs database, described elsewhere in this issue (20), presents a compilation of orthologous groups of proteins from completely sequenced organisms representing phylogenetically distant clades.
Genotyping retrovirus sequences is important in the characterization of viral genetic diversity, tracking of epidemics and vaccine development. NCBI has developed a Web-based genotyping tool for the analysis of retroviral genomes. The genotyping method employs a blastn comparison between the retroviral sequence to be subtyped and a panel of reference sequences provided by the user. An HIV-1-specific subtyping tool uses a set of reference sequences taken from the principle HIV-1 variants.
CGAP provides access to genetic data on normal, precancerous and malignant cells generated by the NCI’s CGAP initiative. CGAP cDNA library information may be retrieved by text words, gene name, clone ID, tissue type, method of sample preparation, stage of tumor development or by UniGene Cluster ID. Expression profiles of cDNA libraries may be compared using either the Digital Differential Display (DDD) tool or the xProfiler. CGAP also includes a directory of tumor suppressor genes and oncogenes.
Serial Analysis of Gene Expression (SAGE) refers to a technique for taking a snapshot of the messenger RNA population of a cell to obtain a quantitative measure of gene expression. NCBI’s SAGEmap service implements many functions useful in the analysis of SAGE data such as a two-way mapping between SAGE tag and UniGene. SAGEmap can also construct a user-configurable table of data comparing one group of SAGE libraries with another. Groups may be chosen for inclusion in the table on the basis of several expression criteria specified by the user. SAGEmap is updated weekly, immediately following the update of UniGene.
The Gene Expression Omnibus (GEO) is an effort to build a data repository and retrieval system for gene expression data derived from any organism or artificial source. Gene expression data derived from spotted microarray (microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and serial analysis of gene expression (SAGE) data, are being accepted. Online tools for the interactive retrieval and analysis of this expression data are under development.
NCBI provides Web access to the OMIM database, a catalog of human genes and genetic disorders authored and edited by Dr Victor A. McKusick at The Johns Hopkins University (22). The database contains information on disease phenotypes and genes, including extensive descriptions, gene names, inheritance patterns, map locations and gene polymorphisms. OMIM currently contains 11 925 entries, including data on 8594 established gene loci and 799 phenotypic descriptions, and is now searchable using the powerful Entrez interface.
Conserved domains are structural modules that have been re-used frequently during the process of evolution. The Conserved Domain Database (CDD) contains domains derived principally from two public protein domain collections, the Simple Modular Architecture Research Tool (Smart) (23), and Pfam (24). NCBI’s Conserved Domain Search (CD-Search) service can be used to search a protein sequence for conserved domains in the CDD.
To produce the CDD a PSI-BLAST-type PSSM is calculated from each domain alignment in the SMART and Pfam databases. These PSSMs are then combined into a library that can be searched using Reverse Position-Specific BLAST (RPS-BLAST), a BLAST variant that searches a database of PSSMs with a protein sequence query. Wherever possible CDD hits are linked to structures which, coupled with a multiple sequence alignment of representatives of the domain hit, can be viewed with NCBI’s 3-D molecular structure viewer, Cn3D (21).
Most of the resources described here include documentation, other explanatory material and references to collaborators and data sources on the respective web sites. Several tutorials are also offered under the Education link from NCBI’s home page. A Site Map provides a comprehensive table of NCBI resources, and the What’s New feature announces new and enhanced resources. Additional tools to guide users to NCBI’s growing array of services are also being developed. A user support staff is available to answer questions at info/at/ncbi.nlm.nih.gov.