Specification of taxonomic constraints
The mainstay of this inconsistency detection system is the capture of taxon specificity of GO classes using two new relationships. Where a GO class should only be used for annotation of gene products from a given taxonomic grouping, the relationship used is only_in_taxon. Conversely, where a gene product should never be used for annotation of gene products from a given taxonomic grouping, the relationship is never_in_taxon. The syntax in which this information is recorded, and that of the other associated files, can be viewed at the locations noted in the methods section.
Where a GO class X has the
only_
in_
taxon relationship to a taxonomic group Y, this indicates that that GO class and its sub-types and parts should only be used for annotation of gene products from organisms of that taxonomic group and its sub-types. There may be some sub-types of the taxonomic group that do not carry out the process, but there will certainly be no examples of the process outside of the named taxonomic group. To give an example, if the class 'lactation' is restricted to use with Mammalia (lactation
only_
in_
taxon Mammalia - Figure ), then this class may only be used for annotation of Mammalian gene products. As the relationship is inherited by all Mammalian sub-types, the class can be used for annotation of gene products from species such as
Ornithorhynchus anatinus (platypus) and
Desmalopex leucopterus (white-winged flying fox), but not for species outside of Mammalia such as
Arabidopsis thaliana (thale cress) and
Gallus gallus (chicken). The constraint is inherited by sub-types and parts of the GO class, and it can be seen in Figure that 'lactation' inherits this constraint from the GO class 'mammary gland development'. The
only_
in_
taxon relationship corresponds to the previously published specificity relationship [
8]. The checking system currently contains 443
only_
in_
taxon constraints (January 2010). We anticipate that there will be scope for a great expansion in the number of constraints, however these are added as the terms are spotted by curators, so the number will continue to build up gradually for some time.
Where a GO class X has the never_in_taxon relationship to a given taxonomic group, this indicates that that GO class and its sub-types and parts should never be used for annotation of gene products from organisms of that taxonomic group or its sub-types. It also indicates that there is no restriction on using the GO class for annotation of gene products from any taxonomic group outside of the one mentioned. To give an example, if the cellular component class 'secretory granule' has the relationship never_in_taxon to the taxonomic group Ascomycota, then that means that the class cannot be used for annotation of gene products from any of the Ascomycota, including Schizosaccharomyces pombe (fission yeast) and Saccharomyces cerevisiae (baker's yeast). This relationship does not place any restriction on using the class outside of this taxonomic grouping. The never_in_taxon relationship is particularly useful in cases where gene products of some taxa are known to be inappropriate for annotation to a given GO class, but where we do not yet have enough information to make an only_in_taxon grouping, or in situations where it would be inappropriate to make an only_in_taxon relationship because the class is widely applicable, having just a few exceptions. The checking system currently contains only two never_in_taxon constraints, as we try to use the more comprehensive only_in_taxon relationship where possible.
Taxon classes are drawn from the NCBI taxonomy hierarchy and supplemented with union classes created for use in-house. For example, to capture the set of organisms carrying out photosynthesis in any form we have created the union class 'Bacteria or Archaea or Viridiplantae or Euglenozoa' (Figure ). This is necessary because sub-types of all of these classes carry out photosynthesis, but in the NCBI taxonomy hierarchy there is no common super-class that includes all of these groups. Where sub-types of a taxon-restricted GO class have narrower implicit taxon specificity than the ancestor class, this is asserted by applying a stricter relationship. For example, photosynthesis is restricted for use with gene products of the group that is the union of 'Bacteria or Archaea or Viridiplantae or Euglenozoa'. However, the sub-type of photosynthesis known in GO as 'PEP carboxykinase C4 photosynthesis' is restricted for use to the smaller Viridiplantae group (Figure ). This narrower taxonomic group further constrains the applicability of the class relative to the ancestor GO class.
Consistency checking using taxon constraints
The main utility of this set of formalized constraints is in checking for inconsistencies between the annotations and the ontologies. A script is run once a week to check for annotations that contravene the constraints (see methods). For example, one of the checks is to see if any gene products from species outside of the taxon Mammalia has been annotated to the GO class 'lactation' or to any of its sub-types. Discovery of an annotation contravening such a constraint would give a clear indication that work was required to improve either the ontology or the annotation. All annotations in the GO central repository are checked with each of the constraints, and a set of the inconsistencies flagged is made available to the groups that produced the annotations.
There are several beneficial outcomes of this regular checking. Problems in the ontology structure or annotation set are quickly spotted and corrected. A common type of error is an inaccuracy in the inheritance path down the long series of relationships in the ontology. Though these are hard to spot by eye, they are easy to automatically detect with this new checking system. Another frequently occurring problem is an ambiguity in a GO class definition that may have led annotators to interpret and use classes in a very different way from that intended by the editors. Prompt detection and reporting of such problems greatly enhances the accuracy of the ontology and the speed of correction. One of the most common errors that we have found with the checks is the annotation of a viral gene product to a cellular component term rather than the equivalent 'host' cellular component term. This can particularly be seen with the EXP and TAS annotations (Table ). In these cases many viral gene products were annotated to terms such as 'endosome lumen' instead of 'host endosome lumen'. As this appears to be a significant issue, we are reviewing our policies on the annotation of viral gene products to these terms. On closer examination we discovered that the majority of these EXP and TAS viral annotations are sourced from Reactome [
9] (in fact the only annotations to use the generic EXP code are those sourced from Reactome). We are exploring the possibility of automatically fixing these annotations to use the "host" term.
| Table 1Numbers of annotation inconsistencies found, classified by evidence code. |
The following section shows further specific examples of improvements that have been made to the annotation sets. A summary of the numbers of annotation inconsistencies being flagged by a selection of the constraints is shown in Table . It is important to note that inconsistencies may reflect problems in either the annotations or ontologies, even though they are flagged as inconsistent annotations. We have been able to make extensive improvements to both datasets as a result of these checks. A summary of the number of annotation inconsistencies that have been found and fixed is shown in Table sorted by evidence code, and in Table sorted by ontology. These tables do not include annotations from the GOA UniProtKB electronic annotation dataset, as we have not yet been able to fully check this very large dataset. We would like to stress that only a very tiny minority of annotations and GO classes are problematic, reflecting the diligence of GO annotators and ontology developers, and the quality of our electronic annotation methods. This is indicated by the very low percentage error rate shown in the last column of each table. However, even a small number of errors can cause problems for our users, and so we consider this checking system to be a valuable contribution to quality control in the GO dataset.
| Table 2Numbers of annotation inconsistencies found by certain rules. |
| Table 3Numbers of annotation inconsistencies found, classified by ontology. |
Inconsistencies found and fixed -- Electronic annotations
Automated pipelines can quickly produce large volumes of annotation for a diverse set of species. In situations where there is no funded manual annotation program such methods are extremely valuable, but generation methods must be strictly controlled to reduce production of incorrect annotations. A large proportion of the queries returned by this checking system were triggered by automatically generated annotation, and so we conclude that implementation of the system is a valuable contribution to quality control in this area.
As examples of this, Drosophila two IEA annotations to GO:0019684 'photosynthesis, light reaction' and eight annotations to GO:0009288 'bacterial-type flagellum' have been caught and removed, prompting a review of the FlyBase automatic annotation pipeline. These spurious annotations arose because of low probability matches between Drosophila proteins and short InterPro domains. The automated Interpro2GO pipeline mapped these false positive domain hits to GO classes. By increasing the stringency for InterPro domain to protein mapping, these taxon errors have been eliminated and the confidence level of all IEA-based GO assignments has improved in FlyBase.
Similarly, automatic transfers of annotations to orthologs needed to be further restricted when the class GO:0001701 'in utero embryonic development' was found to have been transferred from a mammalian gene product to an avian gene product by Ensembl Compara [
3] for 144 annotations.
Inconsistencies found and fixed -- Manual annotation or ontology development
Excluding the viral EXP annotations, the majority (77%) of remaining inconsistencies found were derived from unvetted automated prediction programs, but errors were also found in experimentally derived and manually checked annotations. Some problems in manual annotations were found to have resulted from misunderstandings of the meanings of GO classes between the ontology editors who wrote the class definitions and the annotators who were using them. For example, the class 'sensory perception' was originally defined as 'The series of events required for an organism to receive a sensory stimulus, convert it to a molecular signal, and recognize and characterize the signal.'. To an annotator reading the class name and definition it would seem that this class could be used for annotation of bacterial gene products that enable the bacterium to sense and recognize outside influences. However, the GO class has in its ancestry the class 'cognition', indicating that this is a neurological process and therefore not suitable for annotation of bacterial gene products. To avoid future annotation errors, the definition was clarified by the addition of the sentence: 'This is a neurological process.'. The incorrect bacterial annotations were removed from the source database.
In some cases the class names and definitions can be quite subtle and gene products can accidentally be annotated to classes that are almost, but not quite correct. For example the fungal microtubule organizing center is called the 'spindle pole body', whilst in mammals the microtubule organizing center is called the 'centrosome'. In GO we have classes for 'centrosome organization' and for 'spindle pole body organization', and only fungal gene products should be annotated to the 'spindle pole body organization' class.
Application of a taxon constraint has enabled annotations applied to this class in error to be caught and corrected. Having caught this kind of error once, the ontology developers can improve the definition so that in future the meaning will be more apparent to annotators. This kind of check is particularly useful where a constraint has been applied to a fairly high-level class, showing up ambiguity and consequent errors in the use of the sub-types of the class. The advantage here is that all the sub-types do not need to be individually considered for application of constraints, but that they can be caught using a single high-level constraint.
A small number of other inconsistencies were found to have been brought about by typing errors in accession numbers, and these have been fixed.
Novel electronic annotations
In addition to preventing errors, the new system enables us to produce a large volume of new electronic annotations. In previous years many mappings have been omitted from the InterPro2GO mapping files, because they would not be applicable to all species. However, now such mappings can be used in conjunction with the taxon constraints to ensure that annotations are only transferred to gene products from appropriate species. The new combined system will enable generation of a very large body of novel electronic annotation.