The overall statistics of the results of our analysis of complete viral genomes from GenBank is shown in Table . Our major focus here is on the genomes from the RefSeq collection. Those 86 viral genomes that had previously been reannotated in RefSeq with the aid of our analysis were excluded from our comparisons.
Summary of the results of the analysis of viral genomes currently available in GenBank and those viral genomes for which reference sequences (RefSeq collection) have already been created at NCBI
As shown in the RefSeq section of Table , 8011 protein-coding genes predicted in 1015 complete genomes of viruses of eukaryotic hosts matched the earlier annotation exactly. However, 1047 gene predictions did not match any previously annotated gene, and for 332 out of these 1047 new predictions, hits to known proteins with E
were found by BLASTP search (27
). Interestingly, 135 out of these 332 similarity search supported predictions overlapped with annotated genes but the reading frames were different. A rather large number of 2231 genes in the RefSeq annotated genomes of viruses of eukaryotic hosts were not confirmed by our analysis.
In 92 RefSeq phage genomes, 2414 gene predictions matched the existing annotation exactly. There were 313 entirely new predictions, and 103 of them were corroborated by the BLASTP search with hits to known proteins (E-value <10–5). Again, approximately one-third of predictions corroborated by the similarity search (36 out of 103) overlapped already annotated genes with different reading frames. Our analysis did not confirm 489 genes annotated in phage genomes from the RefSeq collection.
Those 2720 (2231 + 489) genes that were annotated in the RefSeq viral genomes but were not predicted in this study are of a special interest. Subsequent BLASTP searches of these genes protein products against the non-redundant database detected similarity to other known proteins only for 848 out of the 2231 genes annotated in genomes of viruses of eukaryotic hosts and for 137 out of the 489 genes annotated in phages. Overall, we came to the number 985 as the total number of genes not predicted by the ab initio method, though these annotated genes had significant similarity with other known proteins. Therefore, given the whole number of 14 076 genes annotated in 1107 viral genomes, the false negative rate of the ab initio prediction method might be estimated at <10%. Interestingly, in 620 RefSeq viral genomes no annotated gene was missed in predictions.
As is indicated in Table , analysis of the original GenBank genomic records produced a larger fraction of newly predicted genes than determined in the genomes from the RefSeq collection. In turn, a larger fraction (28%) of these new genes produced significant BLASTP hits in comparison with the fraction of new genes in RefSeq (10%) supported by BLASTP search.
The gene prediction results for the RefSeq complete viral genomes were grouped together by virus length and type (Tables and ). Interestingly, a large number of new genes were identified in genomes shorter than 10 kb (892 genomes). For example, in the 8454 nt long genome of single-stranded DNA (ssDNA) enterobacteria phage IF1 (NC_001954) we identified a new 192 nt long gene coding for a homolog of Vibrio cholerae RasR protein. In contrast to all other known genes of this phage, this new gene was located in the DNA strand complementary to the ssDNA present in the virion. The largest numbers of newly identified genes or genes with new start predictions turned out to reside in 193 genomes of double-stranded DNA (dsDNA) viruses and 418 genomes of ssRNA viruses (Table ).
Distribution of the results of the comparative analysis of gene prediction and annotation for viral genomes from the RefSeq collection with the three sets of viruses clustered by genome length
Distribution of the results of the comparative analysis of gene prediction and annotation for viral genomes from the RefSeq collection joined in classes defined by viral classification
Quite a few new predictions among those that had no BLASTP search support were found to overlap already annotated genes. This occurred 274 times (20% of newly predicted genes) in the RefSeq genomes. In 117 of these cases the product of the annotated gene showed similarity to a protein in another species. Nevertheless, the fact of overlap does not indicate a likely false positive prediction per se. Gene overlap is a quite frequent phenomenon in viral genomes, as 52% of viral genes annotated in RefSeq overlap each other.
Ideally, the characteristics of gene prediction accuracy, sensitivity and specificity (defined in Methods), should be determined for a test set of sequences containing experimentally verified genes. However, any given viral genome, except perhaps several of tiny size, would not have a large fraction of genes annotated experimentally. For this reason, we have compiled sets of so-called ‘trustable’ genes and used them as the test sets. For instance, in nine genomes of human herpesviruses (Table ) we identified as trustable the genes both annotated and ab initio predicted. Also, we included in this category those genes that were either annotated or predicted and possessed additional ‘extrinsic’ evidence for being a real gene. This could be an experimentally characterized function or statistically significant sequence similarity to previously characterized proteins. For this compiled set of trustable genes of human herpesviruses, we obtained the average values of Sn = 91% and Sp = 84% as the estimates of the accuracy of our method.
Gene prediction accuracy assessment for nine human herpesviruses
Length comparison between newly predicted genes and genes annotated but missed in predictions indicated that the newly predicted genes tend to be shorter than the ones supposedly real but missed in predictions (Fig. ). The ratio of newly predicted genes to missed genes decreased from 3.81 for genes shorter than 300 nt to 0.49 for genes longer than 300 nt. This observation seems to be related to a preference in the original records to have longer ORFs annotated as genes. The longer ORFs are generally assumed to be more likely to be real genes while ORFs shorter than 300 nt are difficult to discriminate from random non-coding ORFs and are more risky to annotate as genes. This conventional wisdom could lead to over-annotation of ORFs longer than 300 nt as genes while some short genes could be missed. As Figure shows, many ‘long’ annotated genes were indeed not confirmed while quite a few new ‘short’ genes were predicted.
Figure 2 Length distributions of several categories of genes predicted or annotated in 1047 RefSeq viral genomes. Dark gray bars are used for genes annotated but not predicted; light gray bars are used for predicted but not annotated genes whose protein products (more ...)
Assessing and improving the gene start prediction accuracy is another important issue. As described above, for more precise gene start prediction we used the RBS model for long enough viruses of prokaryotic hosts and the Kozak model for viruses of eukaryotic hosts. To give an example, the positional frequency matrices of RBS models specific for phage T4 and phage λ are visualized in ‘logo’ images (28
) in Figure b and c. Notably, these images emphasize the similarity of the nucleotide frequency patterns existing in the RBS of phages to the pattern known for E.coli
(Fig. a). This observation could be expected given that T4 and λ use the E.coli
translational mechanism. While the positional frequency matrix of the RBS model has a fixed length and variable pattern of positional frequencies, the model of the RBS spacer allows for sequences of variable lengths (distances between RBS and start codon) with an invariant positional frequency pattern of the non-coding region.
Figure 3 The positional nucleotide frequency patterns of the GeneMarkS models of the RBS pattern for phage T4 (b) and phage λ (c) are shown in the logo form (27), as compared with the RBS pattern of E.coli shown in (a). Similarly, the Kozak pattern for (more ...)
The logos for the Kozak model determined for the Epstein–Barr virus (HHV4) and for Kaposi’s sarcoma herpesvirus (HHV8) shown in Figure e and f clearly indicate that the information content of these signals is lower than that of RBS. However, the Kozak patterns observed in these viruses are still similar to the Kozak pattern known for the genome of the human host (Fig. d). Accurate evaluation of the gene start prediction accuracy requires a set of genes with experimentally verified gene starts. Evaluation of GeneMarkS performance was done earlier on the test set of E.coli
genes with 5′ ends verified by sequencing of N-terminals of encoded proteins (29
). In this test the accuracy of start prediction was observed to be as high as 94% (5
). A comparison of predictions for phage T4 both with and without the use of the RBS model was carried out (Supplementary Material, Table 1). This comparison showed that predictions made with the use of the RBS model made an almost 10% better match with the annotation, which we consider sufficiently accurate for this well studied phage genome.
Considering viruses of eukaryotic hosts, we compiled a set of genes from nine human herpesviruses with translation starts confirmed by similarity search on a protein level. The 5′ end of the protein having the highest BLASTP hit (excluding one or several self hits) was compared with the 5′ end of the query protein to assess the accuracy of the gene start prediction. After selection of the most unambiguous cases, we obtained an estimate of the accuracy of start prediction as 85% (Supplementary Material, Table 2).
The whole set of newly predicted genes was used further to search for similarity and reconstruct possible orthologous relationships. A database of 1360 newly predicted proteins was compiled and was cross-searched using BLASTP. We found that 237 predicted proteins had some similarity to other members in the database and could be further grouped into 106 protein clusters (Supplementary Material, Table 3). Some of these clusters show highly conserved regions; for instance, a cluster of protein products of new genes identified in poxviruses.
Now we take a closer look at several individual gene predictions. In the well studied genome of Bacteriophage λ (JO2459) we identified as many as five new genes. These genes have already been included in the RefSeq version of the phage λ annotation (NC_001416). Two genes, coding for a putative envelope protein (NP_597781) and Bor protein precursor (NP_597780), are similar to genes in prophage CP-933X, being a part of the E.coli O157 genome (NC_002655). A gene for superinfection exclusion protein B (NP_597779) must have been known for some time since its protein product had been included into the PIR database (P03762). The other two genes were classified as hypothetical.
Our predictions of 16 new genes in Porcine adenovirus A (NC_001997) were corroborated by similarity search. For instance, the protein encoded by predicted ORF6 is a member of a family of DNA polymerases present in 39 other adenoviruses.
A potentially important finding was a gene located in positions 10443–11138 of the genome of Alcelaphine herpesvirus 1 (NC_002531) coding for a 231 amino acid long putative protein (NP_597933). Initially, the new protein was shown to be similar to the uncharacterized putative protein ORF E4 (NP_042601, AAC13792) of unclassified γ-herpesvirus Equine herpesvirus 2. A subsequent PSI-BLAST search revealed a striking similarity between these two proteins and recently discovered antagonists of the lymphocryptovirus antiapoptotic BCL-2 proteins (30
). Later, the sequence of a third non-lymphocryptovirus protein, hypothetical v-BCL2 of another unclassified γ-herpesvirus (Porcine lymphotropic herpesvirus 1) was released (31
) and we have found its sequence to be very similar to the newly identified protein (NP_597933). The PSI-BLAST search profile built from the three proteins further identified similarity with ORF1 protein of Callitrichine herpesvirus 3 (a lymphocryptovirus BALF1-like BCL-2 like protein) and with the BALF1 protein (AAK01916) of Allitrichine herpesvirus 3 (a lymphocryptovirus) with E
-values of 8 × 10–4
and 0.007, respectively. This range of E
-values has been characterized as being indicative of significant sequence similarity (32
). The output of the third iteration of PSI-BLAST included all the BALF1-like proteins at the top of the list. Human GRS protein and other BCL-2-like non-viral proteins were also present in the list at a substantial score distance.
In the next round of analysis, the RPS-BLAST (the NCBI program comparing protein sequences with the Conserved Domain Database) readily detected a BCL motif in all three non-lymphocryptovirus proteins. Moreover, multiple alignment by hierarchical clustering (34
) of the newly predicted protein (NP_597933) with proteins NP_042601, AAM22111 and all the lymphocryptovirus BALF1 proteins (Fig. ) further supported the probable functional significance of the observed pairwise similarity by making evident the patterns of amino acids conserved in all sequences. Interestingly enough, a TBLASTN search failed to reveal additional un-annotated homologs of NP_597933. It is tempting to speculate that, given the function of BALF1 (30
), the newly identified BALF1-like protein may be involved in a complex regulation of the host cell apoptosis, presumably as an antagonist of the herpesvirus antiapoptotic BCL-2 proteins, and, perhaps, as a part of a gene network involved in carcinogenesis.
Figure 4 MultAlin alignment of (putative) BALF1-like proteins (33). The variable N- and C-termini are shown in lower case. Protein names are abbreviated as follows: AHV-1 BALF1, BALF1 homolog (NP_597933) predicted by GeneMarkS in the genome of Alcelaphine (more ...)
Another interesting new finding was a gene (ORF65) predicted in the genome of Epstein–Barr virus (HHV-4, NC_001345). Initially, the protein product of this gene was found to be significantly similar (with an E
-value of <10–5
) to uncharacterized ORF26/ORF35 proteins of other γ-herpesviridae. The subsequent PSI-BLAST search revealed after four iterations a similarity (with an E
-value of 6 × 10–4
) to the ORF26/ORF35 protein family and the ORF48 protein of Equine herpesvirus 4, an α-herpesvirus. The ORF48 protein belongs to the UL14 family of proteins which are present in a minor component of the virion tegument and possess heat shock protein-like functions (35
). Eight further PSI-BLAST iterations brought up all the members of this family. Multiple alignment of the ORF26/ORF35 and UL14-like protein sequences (Fig. ) highlights common features that could not be readily seen in pair-wise alignments, particularly, similar patterns of distribution of charged residues. The observed sequence similarity strongly indicates a common function which remains to be determined by direct experiments. It is likely that these proteins play an important role since the members of the ORF26/ORF35 protein family are now confirmed to be present in all complete genomes of γ-herpesviruses. Interestingly, none of the β-herpesviruses genomes has a TBLASTN detectable homolog of ORF26/ORF35 or UL14, which indicates that ORF26/ORF35 proteins are likely to fulfill a subfamily-level function.
Figure 5 Alignment of the sequences of ORF26/ORF35 and UL14-like proteins. For most sequences, the N- and C-termini are not shown. The coloring is as in Figure . The protein gi numbers and the organism names are: HHV-4 GeneMark_65 prediction (more ...)
Some coding regions in viral genomes were missed in the earlier annotation because of their unusual organization. For instance, some viral genes contain a weak, read-through stop codon, which in the original annotation is considered the end of the gene; thus, a part of the real gene (and protein) is missed. In Barmah Forest virus a GeneMarkS prediction (ORF2), recovers the second part of the non-structural polyprotein gene in positions 5679–7298, missed in the original record U73745. Only after combining together these two parts, the protein (NC_001786) shows full-length similarity to the complete polyprotein encoded, for instance, in Ross River virus.
The vast majority of genes in viral genomes have no introns. There are, however, a few genes with introns and even some with whole separate genes located inside introns, such as an IE glycoprotein gene, HCMVUL37, in Human herpesvirus 5 (NC_001347). Genes interrupted by introns were identified by GeneMarkS as series of separate protein-coding ORFs. For instance, in Enterobacteria phage T4 (introns may appear not only in viruses of eukaryotic hosts but in phages as well) a gene for DNA topoisomerase small subunit protein (NC_000866) consists of two exons both predicted by GeneMarkS as separate ORFs. Developing an ab initio approach for exact prediction of introns in viral genes is a challenging problem. However, quite frequently the combination of data obtained by intrinsic and extrinsic methods becomes easily amenable to further delineation of exon–intron structure by expert analysis. For instance, in the complete genome of Human adenovirus D (Human adenovirus type 17), GeneMarkS revealed 32 potential genes or gene fragments missed in the original annotation (AF108105). Only 11 of them appeared to be complete genes while the other 21 predicted coding regions were manually assembled into nine genes in the RefSeq record (NC_002067).
The above discussed examples of confirmation and functional characterization of new ab initio predictions by subsequent application of an extrinsic method make it quite plausible that many not yet confirmed ab initio predictions will be supported extrinsically as more DNA and protein data become available. Still, the absence of similarity to known proteins may also indicate the uniqueness of the protein whose expression and function might be established only by direct experiments.
The VIOLIN database
Newly defined genome annotations were compiled in the VIOLIN database http://opal.biology.gatech.edu/GeneMark/VIOLIN/
. This database currently has flat text file architecture. Differences between the VIOLIN and GenBank annotations are visualized by color codes (Fig. ). The VIOLIN web site provides hypertext links to the NCBI similarity search programs directly from a genome annotation record. For a gene exactly matching an already known one, the line citing its coordinates is linked to the original gene record in GenBank as well as to the BLink program providing up-to-date information on the protein product (the BLink program, ‘BLAST Link’, displays the prerecorded results of BLAST searches that have been done for every protein sequence in the Entrez proteins data domain). For a predicted gene with no exact or partial match to the previous annotation, links to the programs PSI-BLAST and RPS-BLAST allow one to proceed with further up-to-date characterization of the putative protein. Genes annotated in a GenBank record but not confirmed by our analysis are shown at the bottom of the VIOLIN record with links to the BLink, PSI-BLAST and RPS-BLAST programs to help re-analyze the previously annotated genes.
Snapshot of a sample viral genome record as it appears at the VIOLIN web site.
VIOLIN has been regularly used by the NCBI curators to improve the annotation of viral genomes in the RefSeq collection (36
). Gene predictions have been subjected to additional analysis and manual curation by NCBI staff for quality control and functional assignment. Some of the new findings that originally appeared in VIOLIN and that are now included into annotations of 86 viral genomes in the RefSeq collection are shown in Table . For example, in Fowl adenovirus D (NC_000899) 14 proteins have been added to 15 existing in the original GenBank record AF083975. This was a particularly difficult case because many of the newly added genes were disrupted by frameshifts that likely resulted from sequencing errors. The new tentative protein sequences were assembled from fragments predicted by GeneMarkS using the ORF Finder (R. Tatusov and T. Tatusova, unpublished results), and BLASTP searches. In another example, in Lymphocystis disease virus (NC_001824) 110 coding regions were identified while the original GenBank record (AF083975) contained only one gene for a major capsid protein.
Sample of the newly added RefSeq genes identified by the statistical gene finding methods described in this work