|Home | About | Journals | Submit | Contact Us | Français|
Entrez Gene (http://www.ncbi.nlm.nih.gov/gene) is National Center for Biotechnology Information (NCBI)’s database for gene-specific information. Entrez Gene maintains records from genomes which have been completely sequenced, which have an active research community to submit gene-specific information, or which are scheduled for intense sequence analysis. The content represents the integration of curation and automated processing from NCBI’s Reference Sequence project (RefSeq), collaborating model organism databases, consortia such as Gene Ontology and other databases within NCBI. Records in Entrez Gene are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, genomic location, gene products and their attributes, markers, phenotypes and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI’s Entrez system, via NCBI’s Entrez programming utilities (E-Utilities) and for bulk transfer by FTP.
Entrez Gene is the gene-specific database at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, located on the campus of the US National Institutes of Health in Bethesda, MD, USA. Entrez Gene generates unique integers (GeneID) as stable identifiers for genes and other loci for a subset of model organisms. It tracks those identifiers and uses them to integrate multiple types of information including nomenclature, summary descriptions, accessions of gene-specific and gene product-specific sequences, chromosomal localization, reports of pathways and protein interactions, associated markers and phenotypes. Because the GeneID is used to represent gene-specific information in other databases at NCBI, the full Entrez Gene report includes a wealth of links to gene-specific literature citations, sequences, variations, homologs and databases outside of NCBI. Entrez Gene is integrated with NCBI’s Entrez system for interactive query, Linkout and access by E-Utilities (1).
Data in Entrez Gene result from integration of results from automated analyses and curation by Reference Sequence project (RefSeq) staff. Gene-specific annotation in sequences from NCBI’s RefSeq (2) or the International Nucleotide Sequence Database Collaboration (INSDC) (3) usually serves as the foundation, with value added by with information from collaborating model organism databases, public users and literature review (especially the Gene References into Function or GeneRIFs submitted by the public and staff of the National Library of Medicine). Updates are posted daily, and corrections or suggestions are welcomed (http://www.ncbi.nlm.nih.gov/RefSeq/update.cgi).
As of September 2010, there were almost 7 million current records in Entrez Gene, distributed among more than 7300 taxa (Table 1). Not all the taxa are represented comprehensively in Entrez Gene; most of the eukaryotes, for example, have records only for their mitochondrial or plastid genomes. The Gene Statistics site (http://www.ncbi.nlm.nih.gov/projects/Gene/gentrez_stats.cgi) reports both current and historical counts of records by taxonomic node and species. The history reports can be used to track the growth of the database. For example, the history of the eukaryotic node (http://www.ncbi.nlm.nih.gov/projects/Gene/gentrez_stats.cgi?HIS=1&TAXORG=2759) shows that from 2004 until the present the number of genes represented increased almost 10-fold (221997–2520683), with a 5-fold increase in the number of species (485–2265).
A major goal of the database is to facilitate access to gene-specific information, and thus to expedite data exchange. The unique integer identifier assigned to each record (GeneID) is species specific. In other words, the integer assigned to dystrophin in human is different from that in any other species. The GeneID is reported in RefSeq records as a ‘db_xref’ (e.g. /db_xref= “GeneID:1756”, in GenBank format). The GeneID is also used to define genes in multiple files available for FTP, so that the information associated with GeneIDs is provided for unrestricted public use.
Entrez Gene is also key to representation of gene-specific information at NCBI. The information conveyed by establishing the relationship between sequence and a GeneID is used by many NCBI resources. For example, the names associated with GeneIDs are used in HomoloGene, UniGene and RefSeqs. The curated gene to sequence relationship reported in Entrez Gene is used to inform automated annotation of genomes and UniGene clustering.
Entrez Gene provides multiple reports. For the interactive user, the defaults are web pages or files to download based on a query result, which are accessed by making selections revealed when ‘Display Settings’ or ‘Send to’ is activated (Figure 1).
In addition to these views from Entrez, Gene provides a complete database extraction as well as several special reports for FTP transfer (ftp://ftp.ncbi.nlm.nih.gov/gene/README). Most of the files on the ftp site are refreshed daily. The data are also available from the programmatic interface to Entrez, namely E-Utilities (1).
A GeneID is usually assigned to what is annotated as a gene on a RefSeq record. Exceptions include RefSeqs from bacterial genomes that are annotated whole-genome shotgun sequences. A GeneID may also be assigned when no RefSeq exists. This may occur when an authoritative source for a genome, such as a model organism-specific database, assigns an identifier to what is termed a gene, mapped locus or trait, even though that entity is not completely defined by sequence. When a record in Entrez Gene is established, it is assigned a category (e.g. protein coding, pseudogene, rRNA, unknown) consistent with the molecule types defined by the INSDC. The term ‘unknown’ is used when the category is under review by RefSeq staff, as when some of the sequences defining the gene are annotated with coding regions, but the support for that annotation is inconclusive. The category can change without changing the GeneID.
A full record in Entrez Gene is subdivided into content-specific sections as summarized in its table of contents and the section headers (Figure 2). Each section of the record can be collapsed, and the section divider has both a link (icon: question mark) to documentation and function to return to the top of the page. Not all records will have content in each category, but all have a GeneID, names and information supporting the creation of the record (either sequence, link to an external database or publications). Some of the content is not reviewed by NCBI staff, but integrated automatically. For example, the content in the Interactions section, and several sections of the General Gene Information sections are primarily from external groups [e.g. EcoCyc (4), Gene Ontology Consortium (5), KEGG (6), Reactome (7)]. When genomic RefSeqs annotated with the gene are available, the ‘Genomic regions, transcripts and products’ section includes an embedded, interactive sequence display that can be expanded. To expedite loading of web pages, the default display of the full record often renders only a subset of the bibliographic and interaction information. Links are provided within those sections to navigate to additional pages. To get the full report in one page, the ‘Send to’ option allows saving the record as a text file.
Comprehensive and up-to-date documentation of the contents and maintenance of these sections are provided in the Gene Help Book on NCBI’s bookshelf (http://www.ncbi.nlm.nih.gov/books/NBK3839/).
In addition to the content it displays directly, Entrez Gene provides numerous links to information from other databases within the text and in the Links menu at the right (Figure 2). For example, clicking on ‘RefSeq protein’, ‘RefSeq RNA’ or RefSeqGene in the menu at the right takes users to the Nucleotide database where the RefSeq records specific to one gene can be retrieved, reviewed and analyzed. Similarly, users may select HomoloGene or ProteinClusters (8) links for integration of information about homologs, Map Viewer for extended genomic context and comparative maps, GENSAT, UniGene and GEO for expression data, Conserved Domain Database for domain content of proteins, OMIM (9) for human Mendelian disorders, PubMed and Books for publications. Entrez Gene also provides extensive links to species- or gene-specific databases or gene records in other browsers. Many groups also use the LinkOut (1) method to link their resources to information in Entrez Gene. The integration of explicit content links to gene-specific reports in other NCBI databases, and links to external resources all contribute to making Entrez Gene an effective site to retrieve gene-specific information.
The information in Entrez Gene can be accessed in multiple ways at NCBI (Table 2). The simplest way is to submit an interactive query to Entrez from the NCBI home page and display the results in Gene, or enter a query in any Entrez query bar and restrict the database search to Gene. Starting from Entrez Gene directly, the ‘Limits’ and ‘Advanced Search’ pages make it easier to construct complex queries and submit them. For example, the ‘Limits’ page supports finding genes by chromosome location or in a taxonomic node and the ‘Advanced Search’ page has a query builder, a function to browse all the terms in the database and the fields in which they occur (browse index) and a tool to combine and compare previous query results (search history). All the text in the Entrez Gene record is indexed to support retrieval. For a more comprehensive discussion on how to query Entrez Gene, please refer to the Query Tips section of the help documentation. If the location in the record that matches a query term is not immediately obvious, the text of interest may be in the next page of a paginated section.
Another way to access Entrez Gene is to take advantage of links computed by the Entrez system (1). For example, users starting at PubMed may use the ‘Find related data’ or ‘All links from this record’ options to discover records in Entrez Gene connected to the publication(s). The BLAST group uses the GeneID–sequence relationship maintained by Entrez Gene to help users navigate from protein or mRNA accessions matching a sequence query to Entrez Gene via the blue G icon. Map Viewer provides links from annotated genes to Entrez Gene. And RefSeq records include the GeneID as a db_xref in the gene feature. Thus, users can navigate to Entrez Gene not only by text but also by genomic position, RefSeq annotation and sequence data (BLAST, Nucleotide, Protein).
Users are encouraged to register for MyNCBI (http://www.ncbi.nlm.nih.gov/books/NBK3843/). which supports registering searches and receiving e-mails when records are created or updated. It also supports customizing the display to identify what subset of records returned by a query has particular attributes.
The number of records in Entrez Gene will continue to increase as new species are sequenced and genes are identified. During 2011, sections will be added to the web interface and/or the content will be enhanced so that users will be provided more information in the full report before navigating to related sites at NCBI. This transition was started in 2010 with the addition of the phenotype section. Finally, as new databases with gene-specific content are implemented at NCBI, content and/or links will be added to Entrez Gene.
We welcome feedback with respect to the Entrez Gene interface or any data contained therein. Please select from the Feedback options on any Gene page (Figure 1).
Funding for open access charge: The Intramural Research Program of the National Institutes of Health; National Library of Medicine.
Conflict of interest statement. None declared.