Assignment of terms ‘inferred from sequence or structural similarity’ (ISS) is a potential source of errors within existing annotations. In a sequence analysis, for example, if the top hits are all kinases it may seem reasonable to assume that the query sequence is also a kinase. Chances are, however, that none of the matching sequences are experimentally characterized kinases, but simply look like other sequences called kinases and so on. With the expansion of genomic data, these inferences based upon inferences are becoming increasingly common and are potentially misleading. To eliminate errors from such transitive annotations, the reference genome group have agreed to limit GO annotation based on sequence similarity to experimentally characterized gene products. The gene product identified in supporting evidence (‘with’ column of the gene-association file) must itself be annotated with the same (or more specific) term assigned with an experimental evidence code. FlyBase has reviewed its current set of ISS annotations and found that, for annotations where the similar sequence is recorded, just over 100 were made to genes products that did not have a GO annotation based on experimental evidence codes (). These have all been revised to conform to the new annotation standards. The second part of the ongoing ISS review involves checking the terms for old annotations where the existence of a similar sequence was not recorded.
In the interests of focusing on experimental data and attributing terms directly to publications that contain that data, FlyBase no longer curates new GO annotations based on review articles. As GO annotations for each gene are revised, existing terms based on author statements are traced to the original publication (where possible). Occasionally no experimental support for the term can be found in Drosophila and the term is removed. Similarly, we no longer assign GO terms based on the names of gene products in records submitted to DNA or protein sequence databases. Finally, no new GO terms will be assigned to gene products based on meeting abstracts; this information is better captured from subsequent publications where the data are presented in full.
FlyBase supplements manual GO curation with electronically predicted terms. We have recently standardized our ‘inferred from electronic annotation’ (IEA) data such that it is based on a single source: a mapping between InterPro protein domains and GO terms (6
). This manually generated mapping is under constant revision, partly based on feedback from GO curators, and is considered to be 91–100% accurate (8
). GO annotations based on InterPro domains are now updated for every new release of FlyBase and the InterPro domain ID is now included in each annotation. IEA data from other sources, which were several years old and frequently redundant (e.g. terms based on Panther protein signatures which are now incorporated in InterPro), have now been removed. This has also eliminated potentially confusing discrepancies between the GO annotation sets available from FlyBase and the GOC. The GOC recommends that electronically predicted data be revised annually and, in an effort to enforce good practice, removes any annotations from submitted data sets with IEA evidence codes that are >1 year old.
shows a summary of our current GO data for D. melanogaster. Although many of the changes in GO annotation policy are quite recent, we can already see an improvement. While absolute annotation numbers are relatively unchanged because of deleted IEA data, the number of annotations based on experimental evidence has improved dramatically compared to FlyBase release FB2006_01.
Comparison of total GO annotations in FlyBase releases FB2006_01 and FB2008_08 for all D. melanogaster genes (including those not yet located to the genome) by evidence type