In the last 15 years, since the determination of the complete sequence of the Haemophilus influenzae
strain Rd genome (1
), there has been a rapid increase in the number of prokaryotic genomes that are being sequenced each year. With the cost of DNA sequencing continuing to drop, this has led to an explosion in the number of genes that are predicted computationally, but for which no solid functional annotation can be provided (2
). This is illustrated in , which shows that in a selection of genomes, at best, maybe 70% of the genes have either known, experimentally validated functions or can be assigned function computationally on the basis of sequence similarity, but often with varying or unknown degrees of confidence. With each new genome typically containing anywhere from 500 to 1000 new genes of unknown function, we face the daunting challenge of determining those functions so that the annotation of new genome sequences can be carried out computationally with just a few key functions being tested experimentally. This means that our ability to predict function computationally will need to be quite accurate and must include all genes.
Distribution of annotated genes in selected genomes
Currently, the quality of computational predictions of function is far from perfect. Indeed, for many of the genes in GenBank the present annotations are either incorrect or so general as to be of little value to the user (3–6
). The reason for this is that by far the most common way of making predictions is by checking each newly predicted gene for its similarity to genes annotated in the INSDC (International Nucleotide Sequence Database Consortium) databases (7–9
). When a new gene shares high sequence similarity to an annotated gene then it is assigned the same function as that presumed known gene. If they are identical or nearly so, then this method is quite reliable. However, when the degree of sequence similarity is poor, or perhaps even when it is reasonably high with only a few key amino acids difference, this can lead to problems, because one can be less sure that the new gene really is an ortholog of the known gene. Perhaps, the new gene encodes a protein with a function that is similar to the known one, but different in some subtle way such that its substrate preference has changed. Unless the sequence differences can be interpreted properly so that the new protein's function is not declared to be identical to that of the old gene product, then a mis-annotation ensues and will likely be propagated (3–6
A number of now classic examples of such mis-annotations have been noted in the literature and only when biochemical experiments were carried out, could the annotation be corrected. One classic example was the family of genes labeled hemK. These genes had been annotated as either a protoporphyrinogen oxidase or a DNA methyltransferase. It later turned out that the hemK gene in Escherichia coli
actually encoded a protein methyltransferase, a finding of some considerable interest because the hemK gene is widely conserved from humans all the way to bacteria (10
), although further testing on remote homolog's would still be appropriate. This emphasizes the need for biochemical characterization of gene products whenever possible, but certainly when the sequence distance to a known gene product is insufficiently high to be certain of the assignment. The degree of caution necessary often varies, depending on the level of conservation or its location. In some cases, one or a few amino acid changes in a region of a protein responsible for substrate recognition can completely alter its function. Unfortunately, we often do not know in which regions of a protein we should look for such changes and the computer blindly labels the new gene incorrectly.