The Basic Local Alignment Search Tool (BLAST) programs (
9–
11) perform sequence-similarity searches against a variety of sequence databases, returning a set of gapped alignments between the query and database sequences, and links to full database records, to UniGene, Gene, the MMDB or GEO. Sequences appearing in a BLAST alignment may be selected for bulk download. A BLAST variant, BLAST2Sequences (
12), compares two DNA or protein sequences and produces a dot-plot representation of the alignments.
Each alignment returned by a BLAST search receives a score and a measure of statistical significance, called the Expectation Value (
E-value), for judging its quality. Either an
E-value threshold or a range can be specified to limit the alignments returned. BLAST takes into account the amino acid composition of the query sequence in its estimation of statistical significance. This composition-based statistical treatment, used in conventional protein BLAST searches as well as PSI-BLAST (
11) searches, tends to reduce the number of false-positive database hits (
13).
BLAST offers several output formats including the default ‘pairwise’ alignment, several ‘query-anchored’ multiple sequence alignment formats and a tabular ‘Hit Table’; an easily parsed summary of the BLAST results. Users selecting the ‘new formatter’ option can also view alignments in a ‘Pairwise with identities’ mode that highlights differences between the query and a target sequence. The new formatter also offers an option to display masked characters in lower-case and with different colors rather than simply replacing each with an ‘X’ or an ‘N’. In addition, BLAST can generate a taxonomically organized output that shows the distribution of BLAST hits by organism. A new ‘sequence retrieval’ formatting option allows database sequences to be marked for batch retrieval using check boxes appearing in the BLAST results.
The web BLAST interface allows both the initial search and the results displayed to be restricted to a database subset using the Entrez search syntax. Web BLAST uses a standard URL–API that allows complete search specifications, including BLAST parameters, such as Entrez restrictions and the search query, to be contained in a URL posted to the web page.
A BLAST variant designed to search for nearly exact matches, called MegaBLAST (
14), offers a web interface that handles batch nucleotide queries and operates up to 10 times faster than standard nucleotide BLAST. MegaBLAST is the default search program for NCBI's Genomic BLAST pages that search a set of genome-specific databases and generate, where possible, genomic views of the BLAST hits using the Map Viewer. MegaBLAST is also used to search the rapidly growing Trace Archive and is available for the standard BLAST databases as well. For rapid cross-species nucleotide queries of the Trace Archive as well as the standard BLAST databases, NCBI offers Discontiguous MegaBLAST, which uses a non-contiguous word match (
15) as the nucleus for its alignments. Discontiguous MegaBLAST is far more rapid than a translated search such as BLASTX, yet maintains a competitive degree of sensitivity when comparing coding regions.
Several recent additions have been made to the suite of standard BLAST databases. Environmental sample data can now be searched within the ‘env_nt’ or ‘env_nr’ databases for nucleotide and protein sequences, respectively. A ‘RefSeq’ database is available for protein searches and ‘RefSeq_rna’ and ‘RefSeq_genomic’ databases are available for nucleotide searches. Also available for nucleotide searches are the ‘wgs’ and ‘chromosome’ databases for Whole Genome Shotgun project sequences and complete genomes, chromosomes, or contigs from RefSeq, respectively.
BLink
BLAST Link (BLink) displays pre-computed protein BLAST alignments for each protein sequence in the Entrez databases. BLink can display subsets of these alignments by taxonomic criteria, by database of origin, relation to a complete genome, membership in a COG (
16) or by relation to a 3D structure or conserved protein domain. BLink links are displayed for protein records in Entrez as well as within Entrez Gene reports.