NCBI maintains several resources that support the rat research community including the integrated suite of literature, sequence and BLAST databases, tools to query, retrieve, and display biological information contained in those databases, reference sequence14
records, variation data, two whole genome shotgun assemblies annotated by NCBI, and radiation hybrid and genetic maps. The Rat Genome Resources web page provides an up-to-date portal to access these and other rat-specific data (see Supplementary Table 1
). The following highlights a subset of resources of interest to rat researchers. More information on NCBI resources is included in the Supplementary data and the online NCBI Handbook, and NCBI Help book.
Gene is the central resource for rat gene-specific information at NCBI and includes protein- and non-protein-coding genes, pseudogenes, and mapped phenotypes. The database reports assigned gene ontology (GO) terms, cytogenetic locations, names and symbols, pathways, protein interactions, publications including the GeneRIF annotated bibliography, sequences (GenBank and RefSeq), and links to numerous NCBI and other resources. Data is maintained through a combination of computation, collaboration, and ongoing curation by the Gene and RefSeq staff. Collaboration and curation enhance the content of Gene by: a) integrating information from, and establishing links to, databases such as the Rat Genome Database (RGD), RATMAP, Ensembl, and UniProt; b) resolving identified data conflicts and ambiguities; and c) adding content such as sequences, names, phenotypes, and publications. For example regular updates synchronize rat gene nomenclature in the Gene database with that provided by RGD, add information based on new sequence submissions to GenBank, update GO terms obtained via FTP from the Gene Ontology Consortium, and, in collaboration with UniProtKB, update cross-links between Reference Sequence (RefSeq) proteins and the corresponding Swiss-Prot or TrEMBL proteins.
RefSeq data for rat includes the genomic reference (RGSCv3.4) and Celera assemblies as well as gene-specific products. Accessions are assigned to chromosomes (see Supplementary Table 2
), scaffolds, and contigs. Gene-specific RefSeqs for RNAs and proteins include curated records based on submissions to GenBank, and predicted records that are generated as a product of computing annotation for the genome assemblies.
Curation of RefSeq transcripts and proteins for rat is a continuous process and serves to: a) ensure accurate, full-length sequence for the complete set of transcripts and proteins including loci that use selenocysteine or non-AUG codons; and b) provide additional RefSeq feature annotation such as mature peptides. In addition, the transcript-based curated RefSeq collection represents a high quality complement to genome annotation, because it can be used to identify genes which are not well-represented in one or both genomic assemblies. For example, genes missing from the RGSCv3.4 reference assembly include the smooth muscle alpha-actin (Acta2, GeneID:81633), thymidine kinase 1 (Tk1; GeneID:24834), and gamma-glutamyltransferase 1 (Ggt1, GeneID:116568).
NCBI provides a unique service by annotating both available genome assemblies and displaying order of objects in multiple coordinate systems (sequence, centiMorgans). Genome annotation is computed based on alignments of the curated RefSeq collection described above, rat transcript data, and human, mouse, and rat protein data. The results are distributed in the genomic RefSeq collection, in the RefSeq and Map Viewer FTP sites, and are available for browsing and querying by accession, text, or sequence similarity (via BLAST) in the Map Viewer. Sequence-based data presented in the Map Viewer includes the annotated genome at the gene and transcript level plus an array of additional sequence details including repeats, STS markers, CpG islands, alignments of rat genomic records and of human, mouse, and rat transcript sequences, mapped phenotypes (QTLs) based on placement of flanking and peak markers, and variation data from dbSNP. Alternate displays provide tabular reports and download support (Data as Table View), present alignments supporting the annotation (Evidence Viewer), or support using transcript alignments to generate alternative transcript models for further evaluation (Model Maker). Map Viewer also supports comparative displays of human, mouse, and rat annotations as well as review of order and orientation of assemblies based on placement of markers common to the sequence, genetic, and radiation hybrid maps.
A rat-specific BLAST page facilitates access to several custom BLAST databases. Options include the genome assemblies, RefSeq and GenBank RNAs and proteins, and trace reads from the Trace Archive. Query results for transcript and protein databases return links to the Gene, UniGene and/or GEO databases when an accession is known to that database. Query results for the genome assembly databases include links to view the results Map Viewer in the context of the genome annotation.
dbSNP processes submissions of variation of multiple classes (e.g. insertion/deletions, small tandem repeats, substitutions, etc.) and assigns them unique stable identifiers (ss). Submissions are clustered periodically by alignment to the genome, and submissions of the same variant are assigned an rs number. The placement of these variants on the genome, and calculation of the effect of a variant on an encoded protein, is reported in Map Viewer and the dbSNP GeneView display.