A major challenge facing bioinformatics today is how to effectively annotate an exponentially increasing body of publicly available sequence data. While using expert curators to assign functions to sequences might be considered to be the least error prone approach, this option is far slower than annotation by automated software approaches. On the other hand, automated function annotators often rely on curated sources of information from which to make predictions.
It is a commonly held view that curated sequence annotations are of better quality than automated annotations, however, the error rate of curated annotations can be significant. Estimates of the error rate of curated bacterial genome sequence protein and gene-name annotations lie between 6.8% and 8% [
1,
2]. The error rate of curated eukaryotic sequence annotations is far higher. Artamonova et al (2005) [
3] examined the error rate of UniProt/SwissProt database annotations, consisting of five distinct types of annotation entries, and found an error rate of between 33% and 43%. As this database is widely considered to have a very high standard of curation, we might infer that other sequence databases have at least this annotation error rate, if not higher.
The quality of existing sequence annotations impacts on the quality of future sequence annotations through the commonly used practice of basing sequence annotations on sequence similarity. Errors in the use of sequence similarity based annotation strategies have been implicated in a number of commonly described annotation errors [
4-
6]. A common error is to put far too much emphasis on the importance of the best matching sequence, and not to review the significance of the match. Often this leads to false annotations due to a failure to recognise significant differences in protein domains [
4,
6] or open reading frames [
5]. Overall, problems such as these lead to an increase in error rates of 5%–40% in annotations based on sequence similarity to previously annotated proteins [
7].
There has been some discussion in the literature pin-pointing the importance of annotation error propagation [
4,
8]. Errors made by curators during the initial annotation of sequences can result in the generation of more errors in other data sources owing to the widespread use of sequence similarity-based annotation methods. For instance, misannotation of proteins to IMP dehydrogenase was found to propagate from the PIR-PSD database to SwissProt/TrEMBL, GenBank, and RefSeq [
6]. The initial misannotation was caused by an error made while inferring the protein's name based on sequence similarity. In some cases where IMP dehydrogenase was erroneously assigned to sequences, the IMPDH domain was missing, whereas the CBS domains commonly found in IMP dehydrogenases were present [
6]. Corrections to errors such as these may never occur. Meanwhile, new annotations may be based on erroneous annotations, and these in turn may have been based on erroneous annotations, and so on. Such 'chains of misannotation' [
8] can lead to the progressive increase in annotation error rates.
Sequence annotation data generated by numerous projects has been submitted to the Gene Ontology (GO) Consortium and is available for download in various database releases [
9]. A common use of this data source is to predict the function of novel proteins by using BLAST to find similar annotated sequences present in the database. This may be done by a biologist to find candidate GO terms for a sequence, or automatically by a growing body of electronic annotators [
10-
15]. The reliability of such inductive reasoning is determined by the correctness of the original sequence annotations. If the error rate of the source annotations is high then we would expect that annotation predictions based on them would be at least as high or higher. The current widespread use of sequence similarity based annotation methods simply assumes that input sequence annotations are correct. Without an understanding of the error rate of GO sequence database annotations it is not possible to assess the validity of making new sequence annotations based on evidence from existing sequence annotations.
In comparison to other forms of annotation, such as gene or protein name annotations, GO terms are used to describe the biological context of sequences. Indeed, GO term annotation has become the standard method by which functional information is attributed to sequence data. As far as the authors are aware, at the time of writing there is no published account systematically examining the error rate of curated GO term annotations. However, case-studies [
4-
7] and mathematical models [
8] have shown that using sequence similarity to infer a new annotation is likely to be error prone. As each sequence annotation in the GO database has an evidence code, we can determine the impact of using sequence similarity based annotation on the error rate of curated sequence annotations directly.
As such the aims of this study are to a) develop an approach to estimating the error rate of GO term annotations, b) use this method to estimate the error rate of GO term sequence annotations submitted to the GOSeqLite database, c) and determine the impact, if any, of using sequence similarity based annotation methods on the error rate of annotations.