The Basic Local Alignment Search Tool (BLAST) family of programs (11
) performs sequence-similarity searches, beginning with either a query sequence or a GenBank accession number. Successful searches return a set of gapped alignments between the query and similar database sequences, with links to the full database records. Each alignment receives a score and a measure of statistical significance, called the Expectation Value, for judging its quality.
The NCBI BLAST interface has been re-designed and offers several new search options including the specification of an Expectation Value range, rather than a threshold, for reporting alignments, and the specification of a residue range to limit searches to a portion of the query sequence. XML output is now supported. A new alignment format, called the ‘Hit Table’, provides a compact, tabular summary of the BLAST results including, for each database hit, the positions of alignment starts and stops, coupled with scores and Expectation Values. In addition, BLAST can generate a taxonomically organized output that shows the distribution of BLAST hits by organism in three formats.
A particularly powerful feature of the new BLAST interface allows searches to be restricted to a database subset using standard Entrez search strings; the same restrictions may be applied to screen the output of an initially unrestricted search. These features provide the means to effectively construct a custom database for searching, or to parse the output of a search to include only sequences of interest, respectively.
Along with the revised BLAST interface, NCBI has implemented a standard URL-API which allows complete search specifications, including BLAST parameters and search query, to be contained in the URL posted to the web page. A ‘GetURL’ button on the BLAST pages allows for the saving of the current parameter set, but URLs for custom searches may also be constructed easily by users.
On the algorithmic side, BLAST now takes into account the amino acid composition of the query sequence in its estimation of statistical significance. A composition-based statistical treatment, used in conventional protein BLAST searches as well as PSI-BLAST (12
) searches, tends to reduce the number of false-positive database hits (13
A useful BLAST utility, BLAST2Sequences (14
), compares two DNA or protein sequences and produces a dot-plot representation of the alignments it reports. Translated search options, such as blastx, tblastx and tblastn, now extend the the program’s range beyond blastn and blastp.
Using a new nucleotide BLAST variant, called MegaBLAST (15
), batches of nucleotide sequences can be pasted into a web page or uploaded from a file, and used to search for nearly exact matches in nucleotide databases. MegaBLAST is up to 10 times faster than BLASTn for such searches.
NCBI has developed a semi-automated system for assembling both finished and unfinished human genome sequence on a regular basis so as to incorporate the most current data. The NCBI-generated assembly of the human genome may be searched via a specialized variant called Human Genome BLAST using either nucleotide or protein queries. Human Genome BLAST generates custom ‘Genome view’ of the BLAST hits which is integrated with the Human Genome MapViewer so that the hits can be viewed in the context of a combination of the maps used by the MapViewer, such as maps showing confirmed and predicted gene locations, or EST hits.
Finally, a special database, called the Trace Archive, which contains raw data underlying sequences generated by the various genome projects, may be searched using MegaBLAST. The Trace Archive contains Whole Genome Shotgun (WGS) reads from the mouse as well as data from rat, human, zebrafish and worm.
BLAST-Link (Blink) is a new resource that displays pre-computed protein BLAST alignments for each protein sequence in the Entrez databases. Blink allows for the display of subsets of these alignments by taxonomic criteria, by database of origin, relation to a complete genome, membership in a COG (16
) or by relation to a 3D structure or conserved protein domain. Blink links are displayed for protein records in Entrez as well as within LocusLink reports.