Biosystems
NCBI Biosystems (
www.ncbi.nlm.nih.gov/biosystems/) is a new database within Entrez that collects together molecules that interact in a biological system, such as a biochemical pathway or disease. Currently, Biosystems receives data from two sources: the Kyoto Encyclopedia of Genes and Genomes (
2–4) and the EcoCyc subset of the BioCyc database (
5). These source databases provide diagrams of pathways that display the various components with their substrates and products, as well as links to relevant literature. In addition to being linked to such literature in PubMed, each component within a Biosystem record is also linked to the corresponding records in Entrez Gene and Protein, while the substrates and products are linked to records in PubChem (see below) so that the Biosystem record centralizes NCBI data related to the pathway, greatly facilitating computation on such systems.
BLAST improvements and updates
There have been three main improvements to the NCBI BLAST web site this year. The first is the addition of Sequence Read Archive (SRA) transcript libraries as a new search set, which includes all public sequences from 454 sequencing systems. These sequences can be searched using the ‘Search SRA transcript libraries’ link in the ‘Specialized BLAST’ section of the BLAST web site. NCBI has also reorganized the page for aligning two sequences using BLAST (bl2seq), which now has a search page consistent with the other BLAST pages. On this page, users can enter multiple query sequences and multiple subject sequences, instead of one each as on the older page. The report for the new page is also a standard BLAST report, although a ‘Dot Matrix View’ is available if only one query and one subject sequence are entered. Finally, the BLASTP report now offers a new ‘Multiple Alignment’ option that uses COBALT (
6) to perform a multiple alignment of the query sequence and any subject sequences listed in the BLAST report. If the user selects this link, a separate multiple alignment search is started and displayed in a separate browser window.
COBALT
COBALT (
6) is a new multiple alignment algorithm that finds a collection of pairwise constraints derived from both the NCBI Conserved Domain database (CDD) and the sequence similarity programs RPS-BLAST, BLASTP and PHI-BLAST. These pairwise constraints are then incorporated into a progressive multiple alignment. COBALT searches can be launched either from a BLASTP result page or from the main COBALT search page (
http://www.ncbi.nlm.nih.gov/tools/cobalt/), where either FASTA sequences or accessions (or a combination thereof) may be entered into the query sequence box. A COBALT report will then be displayed with the input protein titles at the top and the multiple alignment at the bottom. From this page, it also possible to get a tree view for the multiple alignment or to launch a modified search using the ‘Edit and Resubmit’ link. In the near future, the tool will provide additional display and download options such as gapped FASTA.
Discovery components within the Entrez system
Underlying and connecting the several databases within the Entrez system is an extensive network of links and precalculated similarity data that have been relatively inaccessible to users. In an effort to assist researchers in finding these links and using them to discover interesting relationships within the NCBI databases, NCBI is developing three types of ‘discovery components’ on Entrez web pages: sensors, which analyze search queries and display data potentially related to the query terms; database ‘ads’, which promote links to highly relevant data in a different database; and analysis tools, which provide further insight on the record being viewed. Examples of such components released so far include the citation and gene sensors in PubMed that, respectively, activate when citation elements or gene symbols appear in a query; the PubMed Central (PMC) and three-dimensional (3D) structure ads on PubMed abstract pages that provide links to free full-text articles or 3D structures reported by the paper; and BLAST and Primer-BLAST links provided on nucleotide sequence records. As part of this effort, the nucleotide and protein record pages were redesigned to highlight numerous links from sequences to related data including literature, Reference Sequences (RefSeqs), genes, gene homologs, transcript clusters, clones and conserved domains.
GeneReviews and GeneTests
NCBI now hosts GeneReviews and GeneTests, two resources developed by a team led by Roberta A. Pagon, University of Washington. GeneReviews (
www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=gene) is a compendium of continually updated, expert-authored and peer-reviewed disease descriptions that relate genetic testing to the diagnosis, management and genetic counseling of patients and families with specific inherited conditions (
7,
8). These reviews can be searched via the GeneReviews tab at the GeneTests home page (
www.ncbi.nlm.nih.gov/sites/GeneTests/), NCBI’s Bookshelf site, NCBI’s All Databases interface or major web search engines.
The GeneTests Laboratory Directory and Clinic Directory list information voluntarily provided by laboratories about their tests and services and by genetics clinics about their clinical genetics services. As appropriate, users can search by a disease name, gene symbol, protein name, clinical genetics service and information about a lab/clinic, such as its name, director and location. Clinics in the USA can also be found via a map-based search. Together, GeneReviews and the GeneTests directories support the integration of information on genetic disorders and genetic testing into a single resource to facilitate the care of patients and families with inherited conditions.
H1N1 influenza sequences
In response to the 2009 H1N1 influenza outbreak, NCBI provided a new web page as part of the NCBI Influenza Virus Resource (described below) that allows direct access to all H1N1 sequences as they are submitted (
www.ncbi.nlm.nih.go/genomes/FLU/SwineFlu.html). From this page, users can download all available sequences (currently 5000) in a single batch. In addition, NCBI has created a record in the Projects database (project ID 37813) to centralize all data related to the H1N1 influenza virus.
MyNCBI updates
MyNCBI allows users to store personal configuration options such as search filters, LinkOut preferences and document delivery providers. After logging into their MyNCBI account, a user can save searches and arrange to receive periodic emails containing updated search results. A MyNCBI feature called ‘Collections’ allows users to save search results and bibliographies indefinitely. Several enhancements have been made to MyNCBI in the past year, particularly regarding sharing information with other users. A new ‘Shared Settings’ panel provides a single interface where a user can select settings to be shared, and then by constructing a simple URL and providing it to other users, the entire group can access these common settings. In MyNCBI, both collections and bibliographies can now be set as either private or public, the latter of which can be shared with multiple users. Finally, the Recent Activity feature has been dramatically expanded to include up to 6 months of activity within MyNCBI, rather than only a user's; previous five actions.
Peptidome
Peptidome (
9) is a new data repository for tandem mass spectrometry peptide and protein identification data generated by the scientific community. Data from all stages of a mass spectrometry experiment are captured, including original mass spectra files, experimental metadata and conclusion-level results. The submission process is facilitated through acceptance of data in commonly used open formats, and all submissions undergo syntactic validation and curation in an effort to uphold data integrity and quality. Peptidome is not restricted to specific organisms, instruments or experiment types; data from any tandem mass spectrometry experiment and from any species are accepted. In addition to data storage, web-based interfaces are available to help users query, browse and explore individual peptides, proteins or entire samples and studies. Metadata for all public samples and studies along with that for the associated proteins in each sample are loaded into Entrez Peptidome.
PubChem 3D and PC3D
PubChem now provides 3D conformers for ~70% of the 25 million records in the PubChem Compound database. Currently, only one conformer is provided for each compound, and these conformers are not necessarily at minimum energy but are low energy conformers selected from a theoretical model (for more information, see pubchem.ncbi.nlm.nih.gov/release3d.html). PubChem also provides precomputed neighboring of all 3D conformers via the ‘Similar Conformers’ link in Entrez. In addition, a new viewer application, PC3D, is available to view both individual conformers and overlays of similar conformers. PC3D is available both as a web application and as a downloadable executable for Windows, Macintosh and Linux platforms.
Sequence Read Archive in Entrez
In 2009, the Sequence Read Archive (SRA, see below) (
10), a repository for data generated by next-generation sequencing technologies, was added to the Entrez system of databases, thereby allowing the SRA data to be searched using fielded text queries and more easily linked with related data at NCBI. Within Entrez SRA (
www.ncbi.nlm.nih.gov/sra/), the data are organized into four types of records: studies (SRP accessions), experiments (SRX accessions), samples (SRS accessions) and runs (SRR accessions). Studies contain one or more experiments, each of which contains one or more runs, each of which in turn may contain data on tens of millions of individual reads. The various record types representing data from a study are all linked to one another within Entrez, allowing users to browse the data easily on the web.
dbVar—Database of genomic structural variation
In 2009, NCBI launched a new database of genomic structural variations called dbVar (
www.ncbi.nlm.nih.gov/projects/dbvar/). While the site is not yet fully functional, NCBI is accepting submissions to dbVar and provides FTP access to these data. At the time of this writing, dbVar contained seven studies with >400 000 reported variants.