An input sequence is searched by BLAST (5
) against a specialized influenza sequences database to determine the virus type (A or B), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus. The database contains one reference sequence for each virus segment and each subtype of the hemagglutinin and neuraminidase (available at ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/ANNOTATION/blastDB.fasta). The top hit in the BLAST result is used to determine the virus type/segment/subtype of the input sequence.
Sample protein sequences
Representatives of published protein and mature peptide sequences for each virus segment and different subtypes for the hemagglutinin and neuraminidase segments of Influenza A virus are maintained on the server side (available in the PROTEIN-A and PROTEIN-B directories at ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/ANNOTATION/). For the segments that encode proteins with large variations in amino acid sequences and mature peptide cleavage sites, more than one protein could be chosen to be included. For example, this collection currently has 16 different protein samples for hemagglutinin of influenza A virus. Based on the segment and subtype determined by the BLAST result, a subset of sample protein sequences is selected and aligned against the input sequence.
A special global protein-to-nucleotide alignment tool, ProSplign (manuscript in preparation, available at ftp://ftp.ncbi.nih.gov/genomes/TOOLS/ProSplign), was designed to accurately annotate spliced genes and mature peptides of influenza viruses. ProSplign also handles input sequences with insertions and/or deletions which may cause a frame shift in the coding region.
Annotation of mature peptides is a challenging task because their length could be very short. A fragment of influenza A virus hemagglutinin gene (GenBank accession number CY018949) query sequence is given in A. The annotated mature peptide from the protein (GenBank accession number BAA21644) was used as a sample protein sequence. BLAST could not find any similarity between the two sequences because of the large sequence variation. Our solution is to use global alignment tool ProSplign. ProSplign alignment along with the peptide sequence is given in A. The translation shown is used as the final annotation.
Figure 1. (A) A fragment of ProSplign alignment of query influenza A virus segment 4 (at the top) against a signal peptide (first 16 amino acids of BAA21644, at the bottom). Similarity is too low for BLAST to find a significant hit. Translation in the middle becomes (more ...)
Some segments of influenza viruses have a spliced gene. ProSplign was specially designed to handle alignments with introns. It automatically finds the exact splice site locations. An example of a spliced alignment is given in B. The sample protein sequence global alignment includes start and stop codons as well as GT/AG splice sites. In that case translation is taken as the final annotation.
There are two types of gaps possible within the alignment of the input and sample sequences. A gap in the input sequence is considered a gap because it reflects the loss of sequence compared to a reference genome. A need to insert a gap in the aligned sample sequence is considered an insertion because it reflects additional sequence in the input sequence compared to the reference genomic sequence. If the length of the insertion/deletion is not a multiple of three, it is a frame shift, because the translation changes its frame over the gap. ProSplign gives a severe penalty for a frame shift indicating that there should be a serious reason for ProSplign to produce a frame shifted alignment. Such an alignment indicates a sequencing error or a critical mutation. ProSplign alignment shows the position of the frameshift and its exact length.
Interpreting alignment result and creating outputs
A successful protein-to-nucleotide alignment should pass the following criteria:
- The input sequence should start with a correct start codon (or span the beginning of input sequence in case of partial 5′ end)
- The input sequence should end with one of the stop codons (or span the end of input sequence in case of partial 3′ end)
- The input sequence should have no frameshifts or internal stop codons
- The number of exon(s) must be correct (two for the second protein of segments 7 and 8 of influenza A virus and segment 8 of Influenza B virus, one exon for all other segments/proteins)
If an alignment passes all four criteria shown, FLAN adopts the translated protein from the alignment as the protein prediction. Positions of the start, stop, splice sites (if present) and mature peptide are taken from the alignment. If an alignment does not pass any of the criteria, FLAN iterates further by aligning next sample protein from the reference subset. If none of the sample proteins can be used to produce a decent alignment, the best aligned sample protein (with the highest alignment score) will be used to generate an error report.
A sample output of the FLAN tool. The top part is a feature table showing feature locations (for gene and CDS) and qualifiers (gene and product). The lower part shows the diagnostic information about the sequence annotation.
The annotation of influenza sequences involves the resource-consuming alignment against a pre-selected protein set. Sometimes up to eight alignment attempts are performed before a good alignment is achieved. Moreover, a pre-selected set of sample proteins could be extended in the future which will further increase the calculation time.
Internally, FLAN is implemented as a NetSchedule service, an NCBI-developed framework which allows the execution of background CGI tasks for more than 30
s (default WEB front end timeout).
NetSchedule is designed to work as a queue manager with poll model of task distribution. Job submitter (in our case—annotate.cgi CGI) connects to a specific queue, submits a job to execution and receives a special string token (job key). After a while, a user can call the CGI and check the job status (‘Check status’ button). Jobs are executed by worker nodes that poll the queue, pick up jobs, compute and return the results (annotation and diagnostic messages, if any). A NetSchedule schema is illustrated in .
Figure 3. A NetSchedule (NS) schema. Client (end user) submits data to CGI at NCBI web server. CGI connects and sends data to the NetCache (NC) server. NC keeps data into blob and returns blob_id back to CGI. CGI connects to the NS server, submits request to execute (more ...)