), is a system for automatically partitioning GenBank sequences, including ESTs, into a non-redundant set of gene-oriented clusters. UniGene clusters are created for all organisms for which there are 70 000 or more ESTs in GenBank and now includes ESTs from 16 animals and 13 plants. Each UniGene cluster contains sequences that represent a unique gene, and is linked to related information, such as the tissue types in which the gene is expressed, model organism protein similarities, the LocusLink report for the gene and its map location. In the human UniGene June 2003 release (build 161), over 5.5 million human ESTs in GenBank have been reduced 50-fold in number to ~108 000 sequence clusters. The UniGene collection has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression (17
). UniGene databases are updated weekly with new EST sequences, and bimonthly with newly characterized sequences.
ProtEST, a tool analogous to BLASTLink, presents pre-computed BLAST alignments between protein sequences from model organisms and the six-frame translations of UniGene nucleotide sequences. Protein sequences that are derived from conceptual translations or model transcripts are excluded. ProtEST links are displayed in UniGene reports with model organism protein similarities. ProtEST reports are updated in tandem with UniGene protein similarities.
The Trace Archive
A newly redesigned Trace Archive interface allows for more flexible searching and download of sequencing traces from a rapidly growing database of over 260 million whole-genome shotgun (WGS), shotgun, EST, clone end and finishing reads from more than 100 organisms.
HomoloGene is a database of both curated and calculated gene orthologs and homologs and now covers 21 organisms. Curated orthologs include gene pairs from the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon and from published reports. Computed orthologs and homologs, which are considered putative, are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. The HomoloGene database can be queried using UniGene ClusterIDs, LocusLink LocusIDs, gene symbols, gene names and nucleotide accession numbers, as well as those terms found in UniGene cluster titles.
The dbMHC is a new NCBI resource dedicated to clinical application and research of the Major Histocompatibility Complex (MHC). The resource includes a Reagent Database section and a Clinical section. The Reagent Database provides an open platform for the submission, evaluation and editing of individual DNA typing reagents as well as typing kit information. All reagents are characterized for allele specificity using an updated allele database based on IMGT/HLA. The dbMHC offers several resources for the analysis and display of the MHC and KIR region, e.g. an interactive formatting sequence retrieval tool, and a sequencing-based typing tool, capable of aligning and interpreting heterozygote sequences. Also featured is dbMHCms, a tool to search descriptive information for known short tandem repeats within the MHC.
The Clinical section contains data generated by the 13th international HLA workshop and international HLA working group and includes sections presenting the results of the Anthropology project with global HLA allele frequencies and the human stem cell transplantation project.
Reference Sequence (RefSeq)
The References Sequence (RefSeq) database (6
), provides curated references for transcripts, proteins and genomic regions, plus computationally derived nucleotide sequences and proteins. The complete RefSeq database is now being provided in the RefSeq directory on the NCBI FTP site. The first release contains over 1 million sequences, including more than 785 000 protein sequences, from about 2000 organisms. To register for the ‘refseq-announce’ mailing list and be informed of new releases or to read more about the RefSeq project, visit the RefSeq home page.
Specialized tools: Open Reading Frame Finder, Spidey and Electronic PCR
OrfFinder performs a six-frame translation of nucleotide sequence and returns the location of each open reading frame (ORF) within a specified size range that it finds. Translations of the ORFs detected can be submitted directly for similarity searching against the standard BLAST or COGs databases.
Spidey is an alignment tool for eukaryotic genomic sequences that takes as input a set of mRNA accessions or FASTA sequences and aligns each to a single genomic sequence. Spidey takes into account predicted splice sites in constructing its alignments and can use one of four splice-site models (vertebrate, Drosophila, Caenorhabditis elegans, plant). Spidey returns exon alignments, protein translations and a summary showing the alignment quality and goodness of match to splice junction patterns for each putative exon.
Electronic PCR (e-PCR) locates Sequence Tagged Sites (STSs) within nucleotide sequences by searching against a non-redundant database of over 155 000 human and 92 000 non-human STSs called UniSTS. OrfFinder, Spidey and e-PCR are available via the ‘Tools’ link on the NCBI home page.
A database of single nucleotide polymorphisms
The database of single nucleotide polymorphisms (dbSNP) (18
) is a repository for single base nucleotide substitutions and short deletion and insertion polymorphisms that contains almost 6 million human SNPs as well as about 1.4 million from a variety of other organisms. Now an Entrez database, dbSNP can be queried from the NCBI home page. Searches for SNPs lying between two markers and batch downloads via Entrez are supported. SNP reports link to 3D visualizations of structures from the MMDB via NCBI’s interactive macromolecular viewer Cn3D (19
), which highlight amino acid changes implied by SNPs in coding regions.