The Basic Local Alignment Search Tool (BLAST) programs (8
) perform sequence-similarity searches against a variety of sequence databases, beginning with either a query sequence or a GenBank accession number. BLAST returns a set of gapped alignments between the query and similar database sequences, with links to the full database records and to other relevant databases such as UniGene or LocusLink. The sequences of any or all of the database hits appearing in a BLAST alignment may be selected for bulk download. A BLAST variant, BLAST2Sequences (10
), compares two DNA or protein sequences using any of the standard BLAST programs and produces a dot-plot representation of the alignments it reports.
Each alignment returned by a BLAST search receives a score and a measure of statistical significance, called the Expectation Value (E
-value), for judging its quality. Either an E
-value threshold or a range can be specified to limit the alignments returned. BLAST takes into account the amino acid composition of the query sequence in its estimation of statistical significance. This composition-based statistical treatment, used in conventional protein BLAST searches as well as PSI-BLAST (9
) searches, tends to reduce the number of false-positive database hits (11
The default BLAST output format is the ‘pairwise’ alignment, however several ‘query-anchored’ multiple sequence alignment formats are available. An alignment option, called the ‘Hit Table’, provides a compact, tabular, easily parsable, summary of the BLAST results including, for each database hit, the positions of alignment starts and stops, scores, and Expectation Values. These outputs may be returned in HTML, XML, text, or as ASN.1. In addition, BLAST can generate a taxonomically organized output that shows the distribution of BLAST hits by organism in three formats.
A particularly powerful feature of the web BLAST interface allows searches to be restricted to a database subset using standard Entrez search strings; the same restrictions may be used to screen the output of an initially unrestricted search. These features provide the means to effectively construct a custom database for searching, or to process the output of a search to include only sequences of interest. Web BLAST uses a standard URL-API that allows complete search specifications, including BLAST parameters, such as Entrez restrictions and the search query, to be contained in a URL posted to the web page.
A recent addition to the BLAST family, called MegaBLAST (12
), facilitates batch nucleotide queries which can be pasted into a web page or uploaded from a file. MegaBLAST is designed to search for nearly exact matches and is up to 10 times faster than standard BLAST for such searches. MegaBLAST is provided to search entire eukaryotic genomes, but it is also used to search a rapidly growing database, called the Trace Archive, which contains over 125 million sequencing traces. The Trace Archive includes whole genome shotgun (WGS), shotgun, EST, clone end and finishing reads from over 30 organisms such as Homo sapiens
, Mus musculus
, Rattus norvegicus
, Danio rerio
, Zea mays
and Caenorhabditis elegans
. To facilitate rapid cross-species nucleotide queries of the Trace Archive, NCBI offers a version of MegaBLAST called Discontinuous MegaBLAST that uses a non-contiguous word match (13
) as the nucleus for its alignments. Searches using Discontinuous MegaBLAST are far more rapid than cross-species translated searches such as blastx, but maintain a competitive degree of sensitivity when comparing coding regions.
The NCBI-generated assembly of the human as well as other submitted genomic assemblies, such as those of the mouse and zebrafish, may be searched using specialized genome BLAST pages. These pages search a set of genome-specific databases and generate, where possible, genomic views of the BLAST hits using the Map Viewer.
BLink displays pre-computed protein BLAST alignments for each protein sequence in the Entrez databases. BLink allows for the display of subsets of these alignments by taxonomic criteria, by database of origin, relation to a complete genome, membership in a Clusters of Orthologous Group (COG) (14
) or by relation to a 3D structure or conserved protein domain. BLink links are displayed for protein records in Entrez as well as within LocusLink reports.