Even considering all of the issues above, one might reasonably expect that as protein databases have grown, annotation has improved and that recently annotated genomes (at least) will be of the highest quality. This is not quite true. What is true is that a BLAST search of a protein that is run today will yield far more results than it would have five or ten years ago, and these results in turn should lead to better annotation. Not all software is equally good, however, and the annotation pipelines vary considerably in their quality. There is also wide variation in the skills and experience of those operating the pipelines. Further complicating matters, some genomes are subjected to careful curation and review, whereas others receive only automated annotation. In the early days of sequencing, the sequencing teams included experts on the biology of each genome, and their manual curation dramatically improved the annotation of those species. Today that is no longer true: high-throughput sequencing centers are large, efficient factories with unique expertise in the methods necessary for sequencing, but they sometimes have very little expertise on the biology of the species they are sequencing. The inconvenient truth is that, as a result of these factors and others, some genomes are poorly annotated even today.
There are several ways in which genome annotation can be erroneous. The first and most fundamental is simply that the gene models may be wrong. Although bacterial gene-finding systems [6
] are highly accurate, finding 98-99% of protein-coding genes in most species, they still occasionally miss genes. Their accuracy at placing the start site is a bit lower, probably closer to 90%, which is excellent but far from the perfect accuracy that some might expect. In the past, the accuracy of (bacterial) start-site prediction was closer to 80%, and many of the genomes in GenBank were predicted with earlier versions of gene finders. Note that all these accuracy figures are much lower for eukaryotic annotation. Some annotation pipelines include algorithms to adjust start sites, which can be done by looking closely at the boundaries of alignments to homologous proteins.
False positives represent another type of erroneous annotation: when the prediction of a gene-finding program does not match any previously known protein, the annotators (or the annotation pipeline software) must decide whether or not to include that prediction in the gene list. Over the years, annotation groups have used a variety of rules to make this decision, and they have inevitably included thousands of false predictions in the publicly available genome annotation. These predictions are mostly harmless unless they result in effort being expended trying to verify them. In some cases, too, they might 'hide' functional RNA genes or true genes in a different reading frame from that of the false prediction.
Perhaps the biggest problem with genome annotation is erroneous and inconsistent naming of genes. Much of this is due to the simple fact that our knowledge of genes has improved but the annotation has remained static. Thus a gene labeled 'hypothetical protein' a few years ago might now have a known function. A second problem is what's known as transitive catastrophe: the phenomenon whereby a name is transferred from one gene to another on the basis of sequence similarity (usually from a BLAST search) but where the original name is incorrect. As more genomes are annotated, and more BLAST searches are run, the name gets transferred to other proteins, and the original source of the name quickly becomes lost. It is well known in the genomics community that thousands of such transitive errors have propagated through sequence databases, and efforts are under way to try to clean up some of the mess. In the meantime, though, many genes remain incorrectly annotated.
Let us consider just one example, selected more or less at random from the bacterium H. influenzae
]. The gene fdxH
encodes formate dehydrogenase, β subunit, GenBank accession number NP438180. When the genome was sequenced in 1995, this gene (encoding a 312 amino acid protein) was similar to very few other genes; even the orthologous Escherichia coli
gene was not yet sequenced. It is very difficult today to reconstruct what the best BLAST hit was back then, but today there are 197 highly significant BLAST hits to 123 distinct species. Thus, it is pretty clear that this gene today should be well-annotated because of the multitude of highly similar proteins. Yet if we look at the list of matching proteins, we find a variety of names given, including not only the name found on NP438180 itself, but also: formate dehydrogenase-O β subunit; formate dehydro-genase, nitrate-inducible, iron-sulfur subunit; HybA protein; formate dehydrogenase-N, Fe-S β subunit, nitrate-inducible; hypothetical protein PaerPA_01004979; hypothetical protein Bpse11_03005113; 4Fe-4S ferredoxin, iron-sulfur binding; and Twin-arginine translocation pathway signal. Some of these names seem to be synonymous, but others clearly are not. To decide properly among them, we need to look at the source of each annotation and at the species to which it is attached.