The number of associations in the GO has grown exponentially since its inception. There were 30
654 associations on 1 July 2000 and 7
954 associations on 1 July 2003 [6
]. This number had grown to more than 16 million in 2007 [8
] and more than 55 million in 2010. Due to the inference methods used, most of the growth has been from IEA associations. In contrast, the curated associations component has only grown linearly. The ontology itself has also been steadily growing, from less than 5000 terms in 2000 [6
] to more than 30
000 in 2010. The Reference Genome Project has been initiated to focus the annotation efforts of various groups on a number of predetermined homologous genes [10
]. This will not only help in seeding the ontology, but through a concentrated effort on certain branches the overall structure of the ontology will also be improved.
One shortcoming of the GO is that annotations only describe the normal, healthy functioning of genes [12
]. In addition, data on functional coordination between multi-function genes are not explicitly stored [31
]. Another shortcoming is that until recently no relationships between the three ontologies were recorded [7
]. Although inter-ontology relationships are now recorded they are only recorded in the full GO, which is not used by all analysis tools, making it necessary to maintain two versions of the GO.
The structure of the GO is predominantly the result of painstaking manual curation over the past 10 years. Through many additions and changes the GO has grown to be quite large and in many cases the structure is not optimal anymore. More specific subsets are available, in the form of a prokaryote subset and GO slims. Although there are quite a large number of GO slims available on the GO website, only seven of them are actively maintained (Of the seven GO slims that are maintained by the GO consortium two are for specific organisms (Schizosaccharomyces pombe
and Candida albicans
), two are for broader classes of organisms (Yeast and Plant slims) and one is a generic GO slim. In addition there is also the UniProtKB-GOA and whole proteome analysis and the Protein Information Resource slim. These GO slims are included as part of the GO flat file, but can also be downloaded individually from the website). The manual creation of GO slims is a painstaking process as the information loss from both the graph-structure and the gene-product annotation needs to be minimized [53
]. A recent paper discusses the automatic creation of GO slims based on an information theoretic approach [53
]. The analysis in the paper shows that the terms chosen for inclusion in existing GO slims are not always ideal and often subject to a bias. Recently, researchers have also used techniques from information theory to automatically organize and optimize the structure of the GO [54
]. It is likely that in the future such approaches will be used more frequently for the construction and curation of both the full GO and GO slims.
There are a number of other ontologies and schemes for cataloguing genes available to researchers. In order to centralize the data, projects have been initiated to clean up and integrate ontologies [4
]. The most important such example is the Open Biomedical Ontologies (OBO) group which, guided by a set of principles similar to the ones the GO was built upon, seeks to standardize bio-ontologies [4
]. As part of their efforts the OBO developed the OBO biological ontology file format for specifying ontologies. Their efforts also include the OBO Foundry, a group that is devoted to the integration of ontologies according to the OBO principles. In addition, this group is also concerned with removing redundant ontologies and aligning the development of ontologies by separate communities. An important tool in the standardization of ontologies is the OBO-Edit ontology editor (www.obo-edit.org
) which is developed and maintained by the GO consortium.
Linking ontologies will increase their usefulness and power, but will also provide many more pitfalls for inexperienced users. Probably the most challenging aspect will be the integration of associations made from different types of evidence and blending the contents of the different ontologies to give maximal information while still remaining clear and concise. These steps will be necessary to ensure that both inter- and intra-ontology comparisons return meaningful results.
- The GO is a structured and controlled vocabulary of terms and relationships for cataloguing gene function.
- Annotations in the GO can be experimentally or computationally derived, different classes of annotations have different levels of confidence.
- The vast majority of annotations in the GO are automatically inferred and not curated.
- Terms in the GO can be compared based on their information content, which is inversely proportional to the probability of a term.
- Genes can be compared based on the terms that they are annotated with in the GO.
- The GO is a powerful tool for data analysis, but its usage is fraught with pitfalls for inexperienced users, which could lead to false conclusions being drawn.