The Database of Genotype and Phenotype (dbGaP)
The correlation of genetic and environmental factors with human disease is vital to the development of diagnostic and therapeutic techniques. Large-scale genotype studies that provide the data for such analysis run the gamut from genome-wide association surveys, medical sequencing, molecular diagnostic assays and surveys of association between genotype and non-clinical traits. The Database of Genotypes and Phenotypes (dbGaP) (2
) was recently created at NCBI to archive, distribute and to support submission of data that correlates genomic characteristics with observable traits. This database is an approved NIH repository for genome wide association study (GWAS) results (grants.nih.gov/grants/gwas/index.htm).
To protect the confidentiality of study subjects, dbGaP accepts only de-identified data and requires investigators to go through an authorization process in order to access individual-level data. Summary metrics for phenotype measures and genotype frequencies, as well as study documents protocols and subject questionnaires are available without restriction.
Authorized access data distributed to primary investigators for use in approved research projects includes de-identified phenotypes and genotypes for individual study subjects, pedigrees and some pre-computed associations between genotype and phenotype. The results of several studies, including the National Eye Institute Age-Related Eye Disease Study (3
), the NINDS Parkinsonism Study (4
), the NHLBI Framingham SHARe and GAIN (2
) were released by dbGaP in 2007.
New BLAST databases
Two new Basic Local Alignment Search Tool (BLAST) databases, one for human and one for mouse, were launched over the past year containing a combination of RefSeq transcript and RefSeq genomic sequences arising from NCBI annotations. Searches of the two databases generate a new, interactive tabular display that partitions the BLAST hits by sequence type—genomic or transcript—and allows sorting by BLAST score, percent of query sequence in the alignment, or percent identity within the alignment. Human and mouse ‘genomic + transcript’ MegaBLAST searches use a faster, indexed algorithm that typically reduces run time by two-thirds. The pre-indexed database has been filtered to eliminate matches to low-complexity and repeat sequences.
BLAST home page redesign
The BLAST homepage has been redesigned to provide easier navigation and simplified BLAST program selection. The new page highlights options for genomic searches, features automatic parameter optimization for searches with short queries and uses an auto-complete input box for specifying organism limitations. Using the new homepage, users can assign titles to their BLAST searches, review recent BLAST search results and save BLAST forms with custom parameters for indefinite periods via My NCBI. As part of the redesign, BLAST Request Ids (RIDs) have been shortened from 36 to 11 characters.
Short Read Archive
The past year has seen a massive increase in sequencing data generated from a new generation of sequencers, including those from Roche-454 Life Sciences, Illumina Solexa and Applied Biosystems SOLiD. This motivated development of the Short Read Archive (SRA) to accommodate deposits from sequencing experiments using these platforms. The SRA recently entered service and currently holds data from 44 studies.
The SRA offers more extensive associations than can be tracked within the Entrez system by separating the representation of study, experiment and sample parameters from actual instrument data. Indexing of these objects will allow for the presentation of a complete pipeline of scientific results going from instrument data all the way through publication. Auxiliary tools for searching short-read data and for visualizing multiple and pair-wise reference alignments are expected to appear in the coming year.
Entrez Nucleotide database is split to become CoreNucleotide, EST and GSS
An important change in Entrez over the past year is the split of the Nucleotide database into three subset databases called ‘CoreNucleotide’, ‘EST’ and ‘GSS’ (specified as ‘nuccore’, ‘nucest’ and ‘nucgss’, respectively, within the E-Utilities). The CoreNucleotide database contains records for all Entrez Nucleotide sequences that are not found within the Expressed Sequence Tag (EST) or Genome Survey Sequence (GSS) divisions of GenBank. These include sequences from all remaining divisions of GenBank, NCBI Reference Sequences (RefSeqs), Whole Genome Shotgun (WGS) sequences, Third Party Annotation (TPA) sequences and sequences imported from the Entrez Structure database. The EST database contains all records found within the EST division of GenBank. EST records contain first-pass single-read cDNA sequences and include no annotated biological features. The GSS database contains all records found within the GSS division of GenBank. GSS records contain first-pass single-read genomic sequences and rarely include annotated biological features. The partitioning of the Nucleotide database makes it easier for researchers to focus on the segment of interest by separating the most richly annotated sequences from those that are sparsely annotated. During a transition period, searches of the Nucleotide database on the Web will return links to search results in the three subsets. However, the Nucleotide database on the web will eventually be phased out entirely in favor of the three subset-databases. The Nucleotide database will be retained for E-Utility use.
The new Entrez Protein Clusters database (www.ncbi.nlm.nih.gov/sites/entrez?
db=proteinclusters), contains over 222 000 sets of almost identical RefSeq proteins encoded by complete prokaryotic or chloroplast genomes and organized in a taxonomic hierarchy. These clusters are used as a basis for genome-wide comparison at NCBI as well as to provide simplified BLAST access, via Concise Microbial Protein BLAST (www.ncbi.nlm.nih.gov/genomes/prokhits.cgi
). Protein Clusters provides annotation information, publications, domains, structures and external links and analysis tools including multiple alignments. Protein Clusters are also linked to genomic neighborhoods via Genome ProtMap (www.ncbi.nlm.nih.gov/sutils/protmap.cgi?
), which maps each protein from a COG (5
) or VOG (Viral Orthologous Groups) (www.ncbi.nlm.nih.gov/genomes/VIRUSES/vog.html
) back to its genome, and displays the genomic segments coding for members of its group of related proteins.