The NCBI Influenza Virus Sequence Database contains nucleotide sequences of all influenza viruses in the EMBL/DDBJ/GenBank databases, as well as protein sequences and their encoding regions derived from the nucleotide sequences. The influenza database is updated, usually within a day or two, after new sequences become available or older sequence records are updated in GenBank. Information for database fields (subtype, segment, host, country, year, etc.) is extracted automatically from GenBank records and examined by NCBI staff. BLAST searches are performed for all new sequences against the influenza virus sequences in GenBank to verify critical information such as subtype, segment, and year. Information that is not available in GenBank records is obtained from the literature, through direct contact with sequence submitters, or by sequence analysis whenever possible.
Figure shows the basic database query interface (http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1
). In querying the database, users may select to search among nucleotide sequences, protein sequences, or coding regions (CDS). Queries may be restricted by using the following additional selectable fields: Virus Species (e.g., Influenzavirus A, Influenzavirus B), Host (e.g., Human, Avian), Subtype (e.g., H3N2, H5), Segment (1 through 8), Country/Region (e.g., Australia, Asia), a range of years (e.g., From year, To year) during which the viruses were isolated, and a range of the lengths of the sequences.
FIG. 2. (A) Influenza Virus Sequence Database query page. (B) List of sequences retrieved from the query page shown in panel A. (C) Multiple alignment of sequences from those listed in panel B. (D) Graphic view of multiple alignment of sequences from those shown (more ...)
On the other hand, in the advanced database search tool (http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/multiple.cgi
), multiple names can be selected simultaneously for species, host, country/region, segment, and subtype. A list of subtypes separated by commas (e.g., H5N1, H3, N2) can be entered in the boxes after “Only these Subtypes” and/or “All Subtypes except.” The number of sequences found by a query will be displayed after the “Update count” button is clicked.
A string of words or a nucleotide/protein sequence (e.g., New York, AGCGAAAGCAGGGGT, or RSKV) can be added to the “Search by a string” box to be included in the search. Search results can be limited to “Full-length sequences only” (does not apply to protein sequences or CDS sequences of segments that encode more than one protein, i.e., the PB1, MP, and NS segments) by checking the appropriate boxes. For nucleotide sequences, “full-length” is defined as not shorter than the complete coding region.
When the box in front of “Remove identical sequences” is checked, all groups of identical sequences in a data set will be represented by the oldest sequence in the group. By checking the box in front of “Sequences from the FLU project only,” search results can be restricted only to sequences from large-scale influenza virus genome sequencing projects, which usually contain complete genomes, detailed source information, and high-quality annotations. Currently, this includes sequences from the NIAID Influenza Genome Sequencing Project (9
), the St. Jude Influenza Genome Project (12
), the Centers for Disease Control and Prevention, the Air Force Institute for Operational Health, and the University of Hong Kong. Sequences of recombinant or lab strains (those flagged as “LAB” in the country field) are not included in the search by default, but they may be included by checking the box next to “Include Lab strains.”
After the “Add to Query Builder” button shown in Fig. is clicked, the selected query and the number of resulting sequences will be shown in “Query Builder.” Nucleotide or protein sequences can also be searched by adding the accession number in the box to the left of the “Find sequence by Accession” button. Multiple queries can be built by repeating the above steps. Any combination of queries from the “Query Builder” can be selected to get sequences from the database.
Sequences found by the selected queries will be shown in a separate window (Fig. ) once the “Get sequences” button is clicked. The sequence display can be reordered by up to three fields sequentially by selecting one field each from the “Ordered by the following fields” boxes. Sequences of interest can be selected by checking the boxes to the left of the accession numbers. The corresponding protein, coding region, or nucleotide sequences of the selected sequences can be downloaded by selecting the appropriate name in the “Select FASTA sequences to download” drop-down menu. To help users identify the downloaded sequences, the following string is inserted between the GenBank sequence identifier and the sequence title in the FASTA definition lines: /host/segment number(name)/subtype/country/year/month/date/. A list of GenBank accession numbers for selected protein or nucleotide sequences can also be downloaded from the “Select accession list to download” menu.
Further sequence analyses of the selected sequences can be performed by clicking the “Do multiple alignment” or “Build a tree” button. Users' own sequences (of the same sequence type in FASTA format) can be added to the selected sequences for analysis by clicking the “Add your own sequences” button. The number of sequences added cannot be more than 128 kilobytes in file size.