Annotation of RefSeq records originates from several sources including the original GenBank submission, collaborating groups, NCBI computational analysis, user feedback and manual curation at NCBI. For example, collaboration supports the RefSeq representation of Saccharomyces cerevisiae
, Drosophila melanogaster
and Arabidopsis thaliana
, which are directly contributed by the Saccharomyces Genome Database (SGD)(5
), FlyBase (6
) and The Institute for Genomic Research (TIGR), respectively. Similarly, the entire viral RefSeq collection is reviewed and curated by the NCBI Viral Genome Advisors group. See the RefSeq Collaborators page for more information about contributions from collaborators (http://www.ncbi.nlm.nih.gov/RefSeq/collaborators.html
). All RefSeq records include explicit cross-links between the nucleotide and protein cognates and to Entrez Gene (7
), which provides gene-oriented access to the RefSeq collection. Additional links, annotated as ‘db_xref’ notations, are provided on some records to organism-specific genome resources such as Mouse Genome Informatics (MGI) (8
) or FlyBase.
For other species, including Apis mellifera
(honey bee), Gallus gallus
(chicken), Homo sapiens
(human), Mus musculus
(mouse) and Rattus norvegicus
(rat), genome annotation is provided by a NCBI computational process that utilizes transcript alignments, protein support and a hidden Markov model (HMM) ab initio
prediction algorithm (see the NCBI Handbook; http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books
). Genomic RefSeq records that are annotated by this process represent genes, transcripts and proteins, and include additional feature annotation to represent STS markers. The available RefSeq transcript dataset, with the ‘NM_’ accession prefix, is an important reagent in this annotation pipeline.
Comprehensive representation of the proteins, explicitly linked to a RefSeq nucleotide record, is a major focus of the RefSeq project. The goal is to represent the full-length protein product; however, partial protein products are represented for some genomes when partial protein annotation is contributed by a collaborator or when proteins are predicted from incomplete genome sequence data. Proteins are annotated by computation and curation. Conserved domains are calculated by an automatic process using data maintained in the NCBI Conserved Domain Database (CDD) (9
); this annotation provides hints about possible function. Likewise, variation features that are located in the coding region are automatically calculated from data available in the NCBI dbSNP database (10
). Additional features including Enzyme Commission (EC) numbers, other landmark regions of the protein sequence and references may be added by curation either by an external collaborator or by NCBI staff.
Transcript records are provided for a subset of eukaryotic species, including those in the Chordata taxonomic lineage, to represent protein-coding sequences, transcribed pseudogenes, ribosomal RNAs and other small RNAs. Annotation results from a mixture of automated and curatorial analysis. Variation features are calculated automatically from data in the dbSNP database, and the nucleotide region corresponding to the annotated protein conserved domains are also provided automatically (as a miscellaneous feature, or ‘misc_feat’). Other features, such as polyadenylation signals and sites, alternate transcription start sites and RNA editing sites, are provided by curation.