|Home | About | Journals | Submit | Contact Us | Français|
Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein–Barr virus was shown to encode a protein similar to α-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins.
Currently, the complete genome of a virus can be sequenced within days. The next step towards understanding the details of a virus life cycle is to identify the whole complement of viral genes and proteins. This information can provide critical insights on many occasions. For instance, for a team working on an antiviral drug design, promising drug targets would be those viral proteins that are basically identical in all major strains of a virus and are significantly different from the proteins in the host, e.g. human.
At the time of this study, the GenBank database (1) contained ~3000 annotated complete viral genome sequences. In most cases, research groups providing the original annotation are unable to detect and confirm all genes experimentally by the time of submission. Computational approaches have therefore been commonly used since the time of pioneer projects such as the sequencing and annotation of phage λ (2).
There are two major approaches to gene identification, intrinsic and extrinsic (3). The intrinsic approach, which can be also called an ab initio statistical approach, uses statistical patterns of nucleotide frequencies and nucleotide ordering observed in a given genome. These patterns are not the same in protein-coding and non-coding DNA sequences; hence a properly trained intrinsic method can recognize protein-coding regions. Extrinsic methods seek to identify evolutionarily conserved sequences in protein-coding regions. These sequences can be detected by similarity searches. The extrinsic method is thus dependent on external information residing outside the sequence of interest.
Intrinsic and extrinsic methods have complementary strengths. Tests of their predictive power performed with sets of sequences containing known genes show that the intrinsic methods have higher sensitivity than the extrinsic methods which usually have higher specificity. Using intrinsic and extrinsic methods in concert is therefore a worthwhile approach (3).
So far, the use of computational gene identification methods in viral genomes by the groups of researchers submitting genomic data to GenBank was primarily restricted to similarity searches. To reduce the risk of missing real genes, a simple statistics-based rule is frequently applied to take into account the difference in length distributions of real genes and random open-reading frames (ORFs). This rule suggests annotating ‘long enough’ ORFs as genes. For instance, in the rat cytomegalovirus genome any ORF longer than 300 nt not overlapping an adjacent ORF to an extent larger than 60% was annotated as a gene (4). Such a simplistic rule, however, could cause substantial over-annotation, especially in genomes with high G+C content.
Another frequently used simplification is the annotation of a gene start by the ‘longest ORF’ rule (assignment of a gene start to the 5′-most ATG codon). A screening of GenBank identified 26 complete viral genomes with a total of 4400 genes, all annotated using this rule. It was nevertheless shown earlier that the true start may not be pinpointed by this rule in ~25% of cases (5).
Viral genomes are different from the genomes of their hosts in several aspects that hamper immediate successful application of the gene finding methods developed for their hosts. An important factor is the rather small size of a viral genomic sequence. Currently, the RefSeq collection (19) contains 891 viral genomes shorter than 10 kb with a total of 2900 genes annotated, 169 genomes with lengths between 10 and 100 kb (3500 genes) and 47 genomes longer than 100 kb (7900 genes). A rather short genome size makes it either impossible to apply previously developed training procedures to derive parameters of high order statistical models (for the shortest viral genomes) or significantly limits the accuracy of these models (even in the case of the longest viral genomes). Another important feature of viral genome organization is the high frequency of gene overlaps that occur in viruses of both prokaryotic and eukaryotic hosts. The gene overlaps in viral genomes appear to be considerably longer than those seen in prokaryotic and, much more rarely, eukaryotic genomes. Furthermore, some annotated and experimentally confirmed viral genes are completely overlapped by others. Repetitive DNA may occupy a large portion of a viral genome; for example, in the Epstein–Barr virus genome (NC_001345) repetitive regions amount to ~30% of the genomic sequence (6), thus making model training more complicated.
In spite of the difficulties mentioned above, several groups have attempted to apply earlier developed statistical gene prediction programs for viral genome annotation. For instance, the GeneMark program (7) was used to identify genes in the genomes of Bovine herpesvirus 4 (8), bacteriophage FKZ of Pseudomonas aeruginosa (9), Mycoplasma virus P1 (10), Mycobacteriophage D29 (11), Stx 2e-encoding phage FP27 (12), coliphage T4 and the marine cyanophage S-PM2 (13), as well as to identify genes in genomes of virulence plasmids in Rhodocuccus equi (14), Shigella felxneri (15) and Escherichia coli (16). Still, these initial attempts did not use a tool developed specifically for the problem in hand (except perhaps the case of T4, where the GeneMark models were adjusted to the genomic T4 sequence).
A significant difference may exist sometimes between the GenBank record and the original publication. For instance, the annotation of the white spot bacilliform virus (GenBank record AF332093) lists 531 protein-coding genes in comparison with only 181 genes mentioned in the original publication (17). On the other hand, only 23 genes are annotated in Rana tigrina ranavirus (GenBank record AF389451), while the original publication (18) describes 105 genes. In order to improve the quality of DNA sequence annotation, the National Center for Biotechnology Information (NCBI) has created the RefSeq collection. While the original GenBank genomic record is maintained as suggested by the authors, the RefSeq record of the same sequence is continuously updated with regard to new relevant data that become available. There were 1191 RefSeq records for complete genomes of viruses of prokaryotic and eukaryotic hosts as of August 2002.
Several attempts have been made to organize data on viral genomes in interactive databases providing tools for analysis of viral genes and proteins (20–22). These projects have been typically focused on specific classes of viruses.
To provide a tool for accurate ab initio gene identification in viral genomes we have modified the earlier developed GeneMarkS program (5) to make it suitable for analysis and gene prediction in viral genomes of different types. As a result of the application of this tool, we have created new annotation records for viral genomes present in GenBank (including its RefSeq part). These records have been compiled in the database VIOLIN (viral genomes online) accessible online at http://opal.biology.gatech.edu/GeneMark/VIOLIN/.
A set of 2945 complete viral genome records was downloaded from GenBank. Since several genomic variants (strains, mutants, isolates) were determined for many viral species, many viral genome records had several other almost identical entries. To filter out this redundancy we have specifically focused on the analysis of viral genomes from the RefSeq collection containing 1191 complete genomic records of viruses of eukaryotic (1071) and prokaryotic (120) hosts. RefSeq contains only one record for each virus species. Notably, these 1191 RefSeq viral genome annotations included 86 records that had been updated with the aid of our new predictions. In what follows, these 86 records have been treated differently in terms of comparison of predicted and annotated genes.
For phage genomes with prokaryotic-type gene organization, computer methods of prokaryotic gene finding could be adjusted rather easily. The prokaryotic version of GeneMark.hmm as well as its self-training version GeneMarkS were previously shown to possess high accuracy both in detecting prokaryotic genes as a whole and in exactly pinpointing gene starts (23,24). Therefore, GeneMarkS was the natural choice as the tool to be applied and adjusted for the analysis of phage genomes. For viruses of eukaryotic hosts, the situation is more complex. Current eukaryotic gene finding algorithms are unable to predict the gene overlaps frequently seen in genomes of viruses of eukaryotic hosts. On the other hand, according to the RefSeq annotation of ~11 000 genes in 1015 genomes of viruses of eukaryotic hosts, only ~300 genes have introns. Therefore, use of the program able to predict overlapping genes provides more benefits than the one predicting exon–intron structures. The program suitable for immediate use and further modifications was again the prokaryotic GeneMarkS, which could identify overlapping protein-coding ORFs while rarely occurring exons would be predicted as separate ORFs.
A viral genomic sequence might not provide enough training data to determine parameters of Markov chain models used in GeneMark.hmm. We turned, therefore, to the heuristic training technique described earlier (24), which is able to derive the parameters of the required models from a DNA sequence as short as 400 nt.
For larger viral genomes, the statistical models initially defined by the heuristic procedure could be iteratively refined further by the unsupervised training procedure implemented in GeneMarkS (24). This iterative procedure used simultaneous training and gene prediction to build models of protein-coding and non-coding sequences. For larger phage genomes, GeneMarkS also derived a model for the ribosomal binding site (RBS) and its spacer (the sequence between the rightmost nucleotide of the RBS and the first nucleotide of the start codon). Parameters of both models were determined from the multiple alignment of the nucleotide sequences situated upstream of the predicted gene starts, with the alignment constructed by the Gibbs Motif Sampler (25). For large enough genomes of viruses of eukaryotic hosts, parameters of a model for the Kozak pattern associated with the translational initiation site were determined by GeneMarkS with yet another modification. This GeneMarkS version allowed the use of the Kozak model for gene start prediction. Further modifications were done to adjust the program to different types of viral genome organization.
Since a linear viral genome cannot have a partial coding region at either terminus, a specific restriction imposed at the program initialization stage excluded this possibility. Conversely, an additional post-processing step was implemented for circular viral genomes to detect genes possibly divided by the split point chosen in the original annotation. For the single-stranded RNA (ssRNA) positive strand viruses whose genes are located in one strand only, an additional procedure identified the strand where gene predictions clustered predominantly and the opposing strand was assigned as completely non-coding.
For every viral genome the training procedure had to determine whether the sequence data were only sufficient for obtaining heuristic models or if a full training cycle of GeneMarkS could be initiated. If GeneMark.hmm with the initially defined heuristic models predicted fewer than a certain number of genes, Nr, then the procedure stopped and these initial predictions were not refined further. Otherwise, the full cycle of GeneMarkS training was initiated. The number 50 was assigned as the default Nr number.
In the training process, if several repetitive copies of some predicted protein-coding ORFs were identified, all copies but one were excluded from the training set of protein-coding regions to reduce bias in the protein-coding sequence model. Predicted ORFs longer than 500 nt that appeared in predicted intergenic regions were excluded from the set of non-coding regions to exclude possible ‘contamination’ of the non-coding training set. For viral genomes with a total size of predicted non-coding regions <10 kb, the training set of non-coding regions was augmented with an additional 10 kb sequence generated by the simplest multinomial model, simulating a sequence with the frequencies of the four nucleotides identical to those observed in the native non-coding region (26).
The step-wise diagram of GeneMarkS self-training and gene prediction for the genome of a virus of prokaryotic host is shown in Figure Figure1.1. For a virus of a eukaryotic host, a reference to the Kozak model should replace the reference to the RBS model. The evaluation of the RBS model fitness was done by assessing both the variance of the RBS signal localization and the information content of the RBS model derived by the Gibbs Sampler. The Kozak model was evaluated in a similar manner. The self-training procedure was terminated as soon as two subsequent iterations produced the same gene predictions. However, in some cases exact convergence was not achieved due to small cyclic variations observed in subsequent iterations. In these cases the self-training was stopped and the reported sequence parse into coding and non-coding regions was the one with the larger number of predicted genes.
Assessment of the accuracy of computer gene prediction is a critically important issue. To characterize errors of two sorts, false positive and false negative, we used two parameters of accuracy, sensitivity and specificity. The value of sensitivity (Sn) is defined as the ratio of the number of true predictions to the number of genes in a test set. The fewer the number of false negatives, the higher the sensitivity. The value of specificity (Sp) is defined as the ratio of the number of true predictions to the total number of predictions made. The fewer the number of false positives, the higher the specificity. To determine sensitivity and specificity values for a particular gene prediction method, one needs a test set of nucleotide sequences with experimentally verified genes. To further define the terms we say that a gene is ‘detected’ if its 3′ end coincides with the 3′ end of a verified one. Additionally, a gene is ‘predicted exactly’ if the positions of both ends coincide with the verified gene ends. The accuracy of ‘exact prediction’ in our terms is the same as the accuracy of the ‘gene start prediction’. This value is defined by the fraction of ‘exactly predicted genes’ among ‘detected’ genes.
The BLAST searches used to characterize newly predicted proteins were conducted using standard parameters: BLOSUM62; penalty for gap ‘10’; penalty for gap extension ‘1’; low-complexity filtering ‘on’. In PSI-BLAST searches, the parameters were the same with the exception that the low-complexity filtering was ‘off’.
The overall statistics of the results of our analysis of complete viral genomes from GenBank is shown in Table Table1.1. Our major focus here is on the genomes from the RefSeq collection. Those 86 viral genomes that had previously been reannotated in RefSeq with the aid of our analysis were excluded from our comparisons.
As shown in the RefSeq section of Table Table1,1, 8011 protein-coding genes predicted in 1015 complete genomes of viruses of eukaryotic hosts matched the earlier annotation exactly. However, 1047 gene predictions did not match any previously annotated gene, and for 332 out of these 1047 new predictions, hits to known proteins with E-values <10–5 were found by BLASTP search (27). Interestingly, 135 out of these 332 similarity search supported predictions overlapped with annotated genes but the reading frames were different. A rather large number of 2231 genes in the RefSeq annotated genomes of viruses of eukaryotic hosts were not confirmed by our analysis.
In 92 RefSeq phage genomes, 2414 gene predictions matched the existing annotation exactly. There were 313 entirely new predictions, and 103 of them were corroborated by the BLASTP search with hits to known proteins (E-value <10–5). Again, approximately one-third of predictions corroborated by the similarity search (36 out of 103) overlapped already annotated genes with different reading frames. Our analysis did not confirm 489 genes annotated in phage genomes from the RefSeq collection.
Those 2720 (2231 + 489) genes that were annotated in the RefSeq viral genomes but were not predicted in this study are of a special interest. Subsequent BLASTP searches of these genes protein products against the non-redundant database detected similarity to other known proteins only for 848 out of the 2231 genes annotated in genomes of viruses of eukaryotic hosts and for 137 out of the 489 genes annotated in phages. Overall, we came to the number 985 as the total number of genes not predicted by the ab initio method, though these annotated genes had significant similarity with other known proteins. Therefore, given the whole number of 14 076 genes annotated in 1107 viral genomes, the false negative rate of the ab initio prediction method might be estimated at <10%. Interestingly, in 620 RefSeq viral genomes no annotated gene was missed in predictions.
As is indicated in Table Table1,1, analysis of the original GenBank genomic records produced a larger fraction of newly predicted genes than determined in the genomes from the RefSeq collection. In turn, a larger fraction (28%) of these new genes produced significant BLASTP hits in comparison with the fraction of new genes in RefSeq (10%) supported by BLASTP search.
The gene prediction results for the RefSeq complete viral genomes were grouped together by virus length and type (Tables (Tables22 and and3).3). Interestingly, a large number of new genes were identified in genomes shorter than 10 kb (892 genomes). For example, in the 8454 nt long genome of single-stranded DNA (ssDNA) enterobacteria phage IF1 (NC_001954) we identified a new 192 nt long gene coding for a homolog of Vibrio cholerae RasR protein. In contrast to all other known genes of this phage, this new gene was located in the DNA strand complementary to the ssDNA present in the virion. The largest numbers of newly identified genes or genes with new start predictions turned out to reside in 193 genomes of double-stranded DNA (dsDNA) viruses and 418 genomes of ssRNA viruses (Table (Table33).
Quite a few new predictions among those that had no BLASTP search support were found to overlap already annotated genes. This occurred 274 times (20% of newly predicted genes) in the RefSeq genomes. In 117 of these cases the product of the annotated gene showed similarity to a protein in another species. Nevertheless, the fact of overlap does not indicate a likely false positive prediction per se. Gene overlap is a quite frequent phenomenon in viral genomes, as 52% of viral genes annotated in RefSeq overlap each other.
Ideally, the characteristics of gene prediction accuracy, sensitivity and specificity (defined in Methods), should be determined for a test set of sequences containing experimentally verified genes. However, any given viral genome, except perhaps several of tiny size, would not have a large fraction of genes annotated experimentally. For this reason, we have compiled sets of so-called ‘trustable’ genes and used them as the test sets. For instance, in nine genomes of human herpesviruses (Table (Table4)4) we identified as trustable the genes both annotated and ab initio predicted. Also, we included in this category those genes that were either annotated or predicted and possessed additional ‘extrinsic’ evidence for being a real gene. This could be an experimentally characterized function or statistically significant sequence similarity to previously characterized proteins. For this compiled set of trustable genes of human herpesviruses, we obtained the average values of Sn = 91% and Sp = 84% as the estimates of the accuracy of our method.
Length comparison between newly predicted genes and genes annotated but missed in predictions indicated that the newly predicted genes tend to be shorter than the ones supposedly real but missed in predictions (Fig. (Fig.2).2). The ratio of newly predicted genes to missed genes decreased from 3.81 for genes shorter than 300 nt to 0.49 for genes longer than 300 nt. This observation seems to be related to a preference in the original records to have longer ORFs annotated as genes. The longer ORFs are generally assumed to be more likely to be real genes while ORFs shorter than 300 nt are difficult to discriminate from random non-coding ORFs and are more risky to annotate as genes. This conventional wisdom could lead to over-annotation of ORFs longer than 300 nt as genes while some short genes could be missed. As Figure Figure22 shows, many ‘long’ annotated genes were indeed not confirmed while quite a few new ‘short’ genes were predicted.
Assessing and improving the gene start prediction accuracy is another important issue. As described above, for more precise gene start prediction we used the RBS model for long enough viruses of prokaryotic hosts and the Kozak model for viruses of eukaryotic hosts. To give an example, the positional frequency matrices of RBS models specific for phage T4 and phage λ are visualized in ‘logo’ images (28) in Figure Figure3b3b and c. Notably, these images emphasize the similarity of the nucleotide frequency patterns existing in the RBS of phages to the pattern known for E.coli (Fig. (Fig.3a).3a). This observation could be expected given that T4 and λ use the E.coli translational mechanism. While the positional frequency matrix of the RBS model has a fixed length and variable pattern of positional frequencies, the model of the RBS spacer allows for sequences of variable lengths (distances between RBS and start codon) with an invariant positional frequency pattern of the non-coding region.
The logos for the Kozak model determined for the Epstein–Barr virus (HHV4) and for Kaposi’s sarcoma herpesvirus (HHV8) shown in Figure Figure3e3e and f clearly indicate that the information content of these signals is lower than that of RBS. However, the Kozak patterns observed in these viruses are still similar to the Kozak pattern known for the genome of the human host (Fig. (Fig.3d).3d). Accurate evaluation of the gene start prediction accuracy requires a set of genes with experimentally verified gene starts. Evaluation of GeneMarkS performance was done earlier on the test set of E.coli genes with 5′ ends verified by sequencing of N-terminals of encoded proteins (29). In this test the accuracy of start prediction was observed to be as high as 94% (5). A comparison of predictions for phage T4 both with and without the use of the RBS model was carried out (Supplementary Material, Table 1). This comparison showed that predictions made with the use of the RBS model made an almost 10% better match with the annotation, which we consider sufficiently accurate for this well studied phage genome.
Considering viruses of eukaryotic hosts, we compiled a set of genes from nine human herpesviruses with translation starts confirmed by similarity search on a protein level. The 5′ end of the protein having the highest BLASTP hit (excluding one or several self hits) was compared with the 5′ end of the query protein to assess the accuracy of the gene start prediction. After selection of the most unambiguous cases, we obtained an estimate of the accuracy of start prediction as 85% (Supplementary Material, Table 2).
The whole set of newly predicted genes was used further to search for similarity and reconstruct possible orthologous relationships. A database of 1360 newly predicted proteins was compiled and was cross-searched using BLASTP. We found that 237 predicted proteins had some similarity to other members in the database and could be further grouped into 106 protein clusters (Supplementary Material, Table 3). Some of these clusters show highly conserved regions; for instance, a cluster of protein products of new genes identified in poxviruses.
Now we take a closer look at several individual gene predictions. In the well studied genome of Bacteriophage λ (JO2459) we identified as many as five new genes. These genes have already been included in the RefSeq version of the phage λ annotation (NC_001416). Two genes, coding for a putative envelope protein (NP_597781) and Bor protein precursor (NP_597780), are similar to genes in prophage CP-933X, being a part of the E.coli O157 genome (NC_002655). A gene for superinfection exclusion protein B (NP_597779) must have been known for some time since its protein product had been included into the PIR database (P03762). The other two genes were classified as hypothetical.
Our predictions of 16 new genes in Porcine adenovirus A (NC_001997) were corroborated by similarity search. For instance, the protein encoded by predicted ORF6 is a member of a family of DNA polymerases present in 39 other adenoviruses.
A potentially important finding was a gene located in positions 10443–11138 of the genome of Alcelaphine herpesvirus 1 (NC_002531) coding for a 231 amino acid long putative protein (NP_597933). Initially, the new protein was shown to be similar to the uncharacterized putative protein ORF E4 (NP_042601, AAC13792) of unclassified γ-herpesvirus Equine herpesvirus 2. A subsequent PSI-BLAST search revealed a striking similarity between these two proteins and recently discovered antagonists of the lymphocryptovirus antiapoptotic BCL-2 proteins (30). Later, the sequence of a third non-lymphocryptovirus protein, hypothetical v-BCL2 of another unclassified γ-herpesvirus (Porcine lymphotropic herpesvirus 1) was released (31) and we have found its sequence to be very similar to the newly identified protein (NP_597933). The PSI-BLAST search profile built from the three proteins further identified similarity with ORF1 protein of Callitrichine herpesvirus 3 (a lymphocryptovirus BALF1-like BCL-2 like protein) and with the BALF1 protein (AAK01916) of Allitrichine herpesvirus 3 (a lymphocryptovirus) with E-values of 8 × 10–4 and 0.007, respectively. This range of E-values has been characterized as being indicative of significant sequence similarity (32,33). The output of the third iteration of PSI-BLAST included all the BALF1-like proteins at the top of the list. Human GRS protein and other BCL-2-like non-viral proteins were also present in the list at a substantial score distance.
In the next round of analysis, the RPS-BLAST (the NCBI program comparing protein sequences with the Conserved Domain Database) readily detected a BCL motif in all three non-lymphocryptovirus proteins. Moreover, multiple alignment by hierarchical clustering (34) of the newly predicted protein (NP_597933) with proteins NP_042601, AAM22111 and all the lymphocryptovirus BALF1 proteins (Fig. (Fig.4)4) further supported the probable functional significance of the observed pairwise similarity by making evident the patterns of amino acids conserved in all sequences. Interestingly enough, a TBLASTN search failed to reveal additional un-annotated homologs of NP_597933. It is tempting to speculate that, given the function of BALF1 (30), the newly identified BALF1-like protein may be involved in a complex regulation of the host cell apoptosis, presumably as an antagonist of the herpesvirus antiapoptotic BCL-2 proteins, and, perhaps, as a part of a gene network involved in carcinogenesis.
Another interesting new finding was a gene (ORF65) predicted in the genome of Epstein–Barr virus (HHV-4, NC_001345). Initially, the protein product of this gene was found to be significantly similar (with an E-value of <10–5) to uncharacterized ORF26/ORF35 proteins of other γ-herpesviridae. The subsequent PSI-BLAST search revealed after four iterations a similarity (with an E-value of 6 × 10–4) to the ORF26/ORF35 protein family and the ORF48 protein of Equine herpesvirus 4, an α-herpesvirus. The ORF48 protein belongs to the UL14 family of proteins which are present in a minor component of the virion tegument and possess heat shock protein-like functions (35). Eight further PSI-BLAST iterations brought up all the members of this family. Multiple alignment of the ORF26/ORF35 and UL14-like protein sequences (Fig. (Fig.5)5) highlights common features that could not be readily seen in pair-wise alignments, particularly, similar patterns of distribution of charged residues. The observed sequence similarity strongly indicates a common function which remains to be determined by direct experiments. It is likely that these proteins play an important role since the members of the ORF26/ORF35 protein family are now confirmed to be present in all complete genomes of γ-herpesviruses. Interestingly, none of the β-herpesviruses genomes has a TBLASTN detectable homolog of ORF26/ORF35 or UL14, which indicates that ORF26/ORF35 proteins are likely to fulfill a subfamily-level function.
Some coding regions in viral genomes were missed in the earlier annotation because of their unusual organization. For instance, some viral genes contain a weak, read-through stop codon, which in the original annotation is considered the end of the gene; thus, a part of the real gene (and protein) is missed. In Barmah Forest virus a GeneMarkS prediction (ORF2), recovers the second part of the non-structural polyprotein gene in positions 5679–7298, missed in the original record U73745. Only after combining together these two parts, the protein (NC_001786) shows full-length similarity to the complete polyprotein encoded, for instance, in Ross River virus.
The vast majority of genes in viral genomes have no introns. There are, however, a few genes with introns and even some with whole separate genes located inside introns, such as an IE glycoprotein gene, HCMVUL37, in Human herpesvirus 5 (NC_001347). Genes interrupted by introns were identified by GeneMarkS as series of separate protein-coding ORFs. For instance, in Enterobacteria phage T4 (introns may appear not only in viruses of eukaryotic hosts but in phages as well) a gene for DNA topoisomerase small subunit protein (NC_000866) consists of two exons both predicted by GeneMarkS as separate ORFs. Developing an ab initio approach for exact prediction of introns in viral genes is a challenging problem. However, quite frequently the combination of data obtained by intrinsic and extrinsic methods becomes easily amenable to further delineation of exon–intron structure by expert analysis. For instance, in the complete genome of Human adenovirus D (Human adenovirus type 17), GeneMarkS revealed 32 potential genes or gene fragments missed in the original annotation (AF108105). Only 11 of them appeared to be complete genes while the other 21 predicted coding regions were manually assembled into nine genes in the RefSeq record (NC_002067).
The above discussed examples of confirmation and functional characterization of new ab initio predictions by subsequent application of an extrinsic method make it quite plausible that many not yet confirmed ab initio predictions will be supported extrinsically as more DNA and protein data become available. Still, the absence of similarity to known proteins may also indicate the uniqueness of the protein whose expression and function might be established only by direct experiments.
Newly defined genome annotations were compiled in the VIOLIN database http://opal.biology.gatech.edu/GeneMark/VIOLIN/. This database currently has flat text file architecture. Differences between the VIOLIN and GenBank annotations are visualized by color codes (Fig. (Fig.6).6). The VIOLIN web site provides hypertext links to the NCBI similarity search programs directly from a genome annotation record. For a gene exactly matching an already known one, the line citing its coordinates is linked to the original gene record in GenBank as well as to the BLink program providing up-to-date information on the protein product (the BLink program, ‘BLAST Link’, displays the prerecorded results of BLAST searches that have been done for every protein sequence in the Entrez proteins data domain). For a predicted gene with no exact or partial match to the previous annotation, links to the programs PSI-BLAST and RPS-BLAST allow one to proceed with further up-to-date characterization of the putative protein. Genes annotated in a GenBank record but not confirmed by our analysis are shown at the bottom of the VIOLIN record with links to the BLink, PSI-BLAST and RPS-BLAST programs to help re-analyze the previously annotated genes.
VIOLIN has been regularly used by the NCBI curators to improve the annotation of viral genomes in the RefSeq collection (36). Gene predictions have been subjected to additional analysis and manual curation by NCBI staff for quality control and functional assignment. Some of the new findings that originally appeared in VIOLIN and that are now included into annotations of 86 viral genomes in the RefSeq collection are shown in Table Table5.5. For example, in Fowl adenovirus D (NC_000899) 14 proteins have been added to 15 existing in the original GenBank record AF083975. This was a particularly difficult case because many of the newly added genes were disrupted by frameshifts that likely resulted from sequencing errors. The new tentative protein sequences were assembled from fragments predicted by GeneMarkS using the ORF Finder (R. Tatusov and T. Tatusova, unpublished results), and BLASTP searches. In another example, in Lymphocystis disease virus (NC_001824) 110 coding regions were identified while the original GenBank record (AF083975) contained only one gene for a major capsid protein.
We have demonstrated that GeneMarkS, the ab initio gene finding method can be adjusted for analysis of viral genomes of different types and can generate useful information. In small viral genomes, any single missed gene could be of significant interest and the reliable identification of a narrow set of putative proteins to work with by extrinsic and experimental methods saves a considerable amount of time and effort. As the never ending discovery of new viruses brings about new names such as Mimivirus (37) or SARS (38), accurate ab initio computer methods for viral gene identification will remain of great value.
Supplementary Material is available at NAR Online.
We thank Dr Yiming Bao for valuable comments on his experience of using the GeneMarkS program for annotating viral genomes. We are grateful to Dr John Besemer, Dr Dwight Hall and Dr Chris Klausmeier for useful remarks on the manuscript. M.B., A.L. and R.M. were supported in part by grant HG00783 from the US National Institutes of Health as well as by a grant jointly awarded by Georgia Tech and the US Centers for Disease Control and Prevention.