|Home | About | Journals | Submit | Contact Us | Français|
FLAN (short for FLu ANnotation), the NCBI web server for genome annotation of influenza virus (http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/annotation.cgi) is a tool for user-provided influenza A virus or influenza B virus sequences. It can validate and predict protein sequences encoded by an input flu sequence. The input sequence is BLASTed against a database containing influenza sequences to determine the virus type (A or B), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of influenza A virus. For each segment/subtype of the viruses, a set of sample protein sequences is maintained. The input sequence is then aligned against the corresponding protein set with a ‘Protein to nucleotide alignment tool’ (ProSplign). The translated product from the best alignment to the sample protein sequence is used as the predicted protein encoded by the input sequence. The output can be a feature table that can be used for sequence submission to GenBank (by Sequin or tbl2asn), a GenBank flat file, or the predicted protein sequences in FASTA format. A message showing the length of the input sequence, the predicted virus type, segment and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus will also be displayed.
The Influenza Genome Sequencing Project (1), funded by the National Institute of Allergy and Infectious Diseases (NIAID), has generated sequence data for nearly 2000 isolates of Influenza virus A and B. As a collaborator of this project, the National Center for Biotechnology Information (NCBI) annotates the sequences and releases them in GenBank as soon as the data are received. Because of the large number of sequences received in a short period of time, an automatic annotation procedure is desired.
The genomes of influenza virus A and B consist of eight RNA segments which encode one to two proteins each. The expression of the MP segment of influenza virus A and the NS segment of influenza virus A and B involve splicing. The hemagglutinin protein of influenza virus A is further processed into mature peptides. The relatively complicated gene expression patterns in these segments mean that general viral genome prediction tools, such as GeneMark (2) which uses heuristic approaches in finding open reading frames, cannot be applied to annotate spliced gene products or mature peptides in influenza viruses.
The Genome Annotation Transfer Utility (3) annotates viral genomes using a closely related reference genome. Although it can handle splicing and mature peptides, users have to maintain a set of reference sequences for all segments and variations of influenza viruses, and select the corresponding one every time a sequence is uploaded for annotation. Since only one reference genome can be used at a time, it is hard for users to select the right reference genome before the annotation.
We developed a program FLAN (short for FLu ANnotation) to automatically annotate genomes of influenza virus A and B based on existing protein sequences in GenBank. For each segment/subtype of the viruses, a set of sample protein sequences is maintained on the server. The input influenza sequence is then aligned against corresponding protein set with a ‘Protein to nucleotide alignment tool’ (ProSplign). The translated product from the best alignment to the sample protein sequence is used as the predicted protein encoded by the input sequence. This program has been used for the annotation of more than 21000 published GenBank records of influenza virus A and B sequences generated from the NIAID Influenza Genome Sequencing Project, the St Jude Influenza Genome Project (4) and the Centers for Disease Control and Prevention. Here, we describe the web version of the FLAN program as part of the NCBI Influenza Virus Resource (http://www.ncbi.nlm.nih.gov/genomes/FLU/).
An input sequence is searched by BLAST (5) against a specialized influenza sequences database to determine the virus type (A or B), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus. The database contains one reference sequence for each virus segment and each subtype of the hemagglutinin and neuraminidase (available at ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/ANNOTATION/blastDB.fasta). The top hit in the BLAST result is used to determine the virus type/segment/subtype of the input sequence.
Representatives of published protein and mature peptide sequences for each virus segment and different subtypes for the hemagglutinin and neuraminidase segments of Influenza A virus are maintained on the server side (available in the PROTEIN-A and PROTEIN-B directories at ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/ANNOTATION/). For the segments that encode proteins with large variations in amino acid sequences and mature peptide cleavage sites, more than one protein could be chosen to be included. For example, this collection currently has 16 different protein samples for hemagglutinin of influenza A virus. Based on the segment and subtype determined by the BLAST result, a subset of sample protein sequences is selected and aligned against the input sequence.
A special global protein-to-nucleotide alignment tool, ProSplign (manuscript in preparation, available at ftp://ftp.ncbi.nih.gov/genomes/TOOLS/ProSplign), was designed to accurately annotate spliced genes and mature peptides of influenza viruses. ProSplign also handles input sequences with insertions and/or deletions which may cause a frame shift in the coding region.
Annotation of mature peptides is a challenging task because their length could be very short. A fragment of influenza A virus hemagglutinin gene (GenBank accession number CY018949) query sequence is given in Figure 1A. The annotated mature peptide from the protein (GenBank accession number BAA21644) was used as a sample protein sequence. BLAST could not find any similarity between the two sequences because of the large sequence variation. Our solution is to use global alignment tool ProSplign. ProSplign alignment along with the peptide sequence is given in Figure 1A. The translation shown is used as the final annotation.
Some segments of influenza viruses have a spliced gene. ProSplign was specially designed to handle alignments with introns. It automatically finds the exact splice site locations. An example of a spliced alignment is given in Figure 1B. The sample protein sequence global alignment includes start and stop codons as well as GT/AG splice sites. In that case translation is taken as the final annotation.
There are two types of gaps possible within the alignment of the input and sample sequences. A gap in the input sequence is considered a gap because it reflects the loss of sequence compared to a reference genome. A need to insert a gap in the aligned sample sequence is considered an insertion because it reflects additional sequence in the input sequence compared to the reference genomic sequence. If the length of the insertion/deletion is not a multiple of three, it is a frame shift, because the translation changes its frame over the gap. ProSplign gives a severe penalty for a frame shift indicating that there should be a serious reason for ProSplign to produce a frame shifted alignment. Such an alignment indicates a sequencing error or a critical mutation. ProSplign alignment shows the position of the frameshift and its exact length.
A successful protein-to-nucleotide alignment should pass the following criteria:
If an alignment passes all four criteria shown, FLAN adopts the translated protein from the alignment as the protein prediction. Positions of the start, stop, splice sites (if present) and mature peptide are taken from the alignment. If an alignment does not pass any of the criteria, FLAN iterates further by aligning next sample protein from the reference subset. If none of the sample proteins can be used to produce a decent alignment, the best aligned sample protein (with the highest alignment score) will be used to generate an error report.
The first output of a successful annotation is a feature table (http://www.ncbi.nlm.nih.gov/Sequin/table.html), which is a five-column, tab-delimited table of feature locations and qualifiers (Figure 2). FLAN also creates the ASN.1, XML and GenBank formatted views of the same annotation, using the following NCBI developed utilities: tbl2asn (http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html) and asn2xml (http://www.ncbi.nlm.nih.gov/Web/Newsltr/V14N1/toolkit).
The annotation of influenza sequences involves the resource-consuming alignment against a pre-selected protein set. Sometimes up to eight alignment attempts are performed before a good alignment is achieved. Moreover, a pre-selected set of sample proteins could be extended in the future which will further increase the calculation time.
Internally, FLAN is implemented as a NetSchedule service, an NCBI-developed framework which allows the execution of background CGI tasks for more than 30s (default WEB front end timeout).
NetSchedule is designed to work as a queue manager with poll model of task distribution. Job submitter (in our case—annotate.cgi CGI) connects to a specific queue, submits a job to execution and receives a special string token (job key). After a while, a user can call the CGI and check the job status (‘Check status’ button). Jobs are executed by worker nodes that poll the queue, pick up jobs, compute and return the results (annotation and diagnostic messages, if any). A NetSchedule schema is illustrated in Figure 3.
FLAN is available at http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/annotation.cgi. The input data of FLAN is one or multiple sequences of influenza A virus or influenza B virus in FASTA format (http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#FASTAFormatforNucleotideSequences), either pasted directly into a text box, or uploaded from a local file.
There are no parameters to select or enter to run this tool.
The output can be selected from a drop-down menu. The formats include a feature table, a GenBank flat file, the predicted protein sequences in FASTA format or XML. A message showing the predicted virus type, segment, and subtype for the hemagglutinin and neuraminidase segments of influenza A virus are displayed as well. Warning messages are shown along with the feature table, if the input sequence does not have a start/stop codon or contains ambiguities. In case the frameshifts are found, or a stop codon is introduced within the coding region, no feature table is produced and an error message is shown instead, indicating the nature (insertion, deletion or mutation), the length and the location of the error.
There are three major applications for the FLAN web server.
FLAN uses published influenza protein sequences as training sets. It will not annotate putative proteins reported in the literature (6,7) but not seen in sequence databases, nor will it predict putative novel proteins because of mutations. There are chances that it will not work as expected for some new sequence variations. Please report such cases to us so that we can improve this tool.
The authors would like to acknowledge Anatoliy Kuznetsov for providing information for Figure 3 and Alexander Souvorov for helpful discussion. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health.
Conflict of interest statement. None declared.