|Home | About | Journals | Submit | Contact Us | Français|
Gene Ontology (GO) provides a controlled vocabulary to describe the attributes of genes and gene products in any organism. Although one might initially wonder what relevance a ‘controlled vocabulary’ might have for cardiovascular science, such a resource is proving highly useful for researchers investigating complex cardiovascular disease phenotypes as well as those interpreting results from high-throughput methodologies. GO enables the current functional knowledge of individual genes to be used to annotate genomic or proteomic datasets. In this way, the GO data provides a very effective way of linking biological knowledge with the analysis of the large datasets of post-genomics research. Consequently, users of high-throughput methodologies such as expression arrays or proteomics will be the main beneficiaries of such annotation sets. However, as GO annotations increase in quality and quantity, groups using small-scale approaches will gradually begin to benefit too. For example, genome wide association scans for coronary heart disease are identifying novel genes, with previously unknown connections to cardiovascular processes, and the comprehensive annotation of these novel genes might provide clues to their cardiovascular link. At least 4000 genes, to date, have been implicated in cardiovascular processes and an initiative is underway to focus on annotating these genes for the benefit of the cardiovascular community. In this article we review the current uses of Gene Ontology annotation to highlight why Gene Ontology should be of interest to all those involved in cardiovascular research.
Until recently, the study of specific pathways or individual molecules has been the major approach to understanding the intricate molecular and cellular details associated with cardiovascular processes and disease, with thousands of publications each year adding to our accumulated knowledge of these systems. However, genome-sequencing projects have led to the identification of thousands of genes in higher vertebrates, the majority of which are only characterised by their sequence and genomic location, with their potential involvement in cardiovascular systems awaiting experimental investigation. High-throughput methodologies, such as expression arrays or proteomics are providing substantial information about the properties of these newly identified genes, through the detailed characterisation of the molecular composition of entire tissues, cells or organelles at both specific developmental and specific disease states or through protein binding or cellular location studies. Consequently, such investigations provide researchers with the potential to rapidly increase our understanding of complex interactions and biological functions within the cardiovascular system. However integrating such high-throughput data with the detailed published experimental knowledge about the function of individual genes is an essential step that is necessary to ensure that all experimental approaches make an impact on current research projects. Fortunately, the Gene Ontology Consortium (GOC) has been developing terms to describe the functional attributes of gene products, across all species, in a consistent and computer-friendly manner to enable the integration of all of these data. This system of terms, called Gene Ontology (GO), enables the accumulated knowledge about individual gene products and their functional domains to be included in individual gene records, in biological sequence databases, and within high-throughput analysis software. This information can then be applied by high-throughput analysis software to aid in the interpretation of large datasets. By providing current functional knowledge in a format that can be exploited by high-throughput technologies, the GOC provides a major freely available public annotation resource that can help bridge the gap between data collation and data analysis  (www.geneontology.org).
The success of GO rests on the philosophy behind it; GO was designed by biologists to improve data integration and consequently enables genes to be classified and grouped together according to their functional properties [2–4]. At times the English language can be rather vague, with the majority of words having a variety of subtly different meanings. Similarly, scientific terms or phrases can have dual meanings. Consequently, one of the primary aims of GO is to create a single, explicit definition for each biological term so that these terms can be applied and interpreted consistently by all biologists. All such terms are provided as three structured vocabularies of terms (ontologies) that describe the molecular functions that gene products normally carry out, the biological processes that gene products are involved in and lastly the subcellular locations (cellular components) where gene products are active. For example, the annotations for cholesteryl ester transfer protein (CETP) include the Molecular Function term: ‘cholesterol transporter activity’, the Biological Process term: ‘reverse cholesterol transport’ and the Cellular Component term: ‘high-density lipoprotein particle’; whereas the annotations for troponin C type 1 (TNNC1) include the Molecular Function term: ‘troponin I binding’, the Biological Process term: ‘regulation of muscle contraction’ and the Cellular Component term: ‘troponin complex’.
The terms in GO are structured as directed acyclic graphs, where each term can have multiple relationships to broader ‘parent’ and more specific ‘child’ terms (Fig. 1). This hierarchical structure produces a representation of biology that allows a greater amount of flexibility in data analysis than would be afforded by a format based on a simple list of terms. Users can manipulate the structure to see either a broad overview of the general functional attributes presented by a set of data, or focus in on specific sections in the ontology to investigate in greater detail.
The second resource supplied by the GOC are datasets of GO terms associated with the appropriate genes and their products, thus providing a resource of diverse detailed functional annotation for many different species  (www.geneontology.org/GO.current.annotations.shtml). These annotations are created by 13 different annotation groups, including Gene Ontology Annotation @ EBI (GOA), FlyBase, and the Mouse Genome Database. Depending on the amount of published data available, gene/protein identifiers can be annotated with multiple GO terms from any, or all, of the three gene ontologies (Fig. 1). Annotations can be produced either by a curator reading published scientific papers and manually creating each association or by a software engineer applying computational techniques to predict associations . These two broad categories of techniques have their own advantages and disadvantages, but both require skilled biologists and software engineers to ensure that conservative, high-quality annotations are created. The annotation of each gene is therefore a potentially long laborious process, which for a highly studied gene like TNF could take several days (Fig. 2) or, for a more recently described gene like CDKN2B, may only take a few hours (Fig. 1).
As the number of high-throughput methodologies has increased, so has the number of ways in which GO annotation data has been exploited to link experimental results to current functional knowledge.
Proteomes and differentially regulated mRNAs can be analysed with GO data to provide an overview of the predominant activities the constituent proteins are involved in or where they are normally located. For example Ashley et al.  used GO to compare the genes up-regulated in de novo atherosclerosis with those associated with in-stent restenosis. They found a significant proportion of genes up-regulated following de novo atherosclerosis were associated with inflammatory processes, whereas a high proportion of in-stent restenosis up-regulated genes had GO terms indicating an involvement with cell growth and association with the extracellular matrix .
Often the generation of hypotheses to explain proteome-wide alterations in response to certain diseases, such as cardiac hypertrophy , or stress states, such as hypoxia , rely on the use of GO annotation data. In such studies an indication of underlying cellular mechanisms that may account for an observed phenotype can be obtained using GO to cluster subsets of proteins that share related GO annotation, and found to be similarly over- or under-expressed in the disease or stress state. For example, Pan et al.  found over-expressed cardiac microsomal membrane proteins in mouse hyper- or hypocontractile hearts were enriched with GO terms describing fat and carbohydrate metabolism and G-protein-dependent signalling pathways . Enrichment of these GO terms validated the investigators proteomic method and was consistent with the suggestion that the deregulation of calcium-dependent cardiac contractility resulted in compensatory growth activities.
The ability to review experimental results with respect to known functional information has also proved useful when investigators need to select a subset of proteins to analyse in greater depth in order to identify new sets of biomarkers for a certain disease. This approach has enabled investigators of buccal carcinoma , Parkinson disease  and chronic kidney disease  to identify new biomarkers for these diseases. Furthermore, in all of these reports the enriched GO categories indicated disease-associated deregulated processes.
GO can also be used to provide a link between the protein binding network and the activities/locations of the participant proteins. Use of cellular component GO annotations can aid data visualisation or confirm whether a particular set of interactions is likely to occur in vivo. Dyer et al.  used GO data to investigate interactions of human proteins with viral pathogens and found that many different pathogens target the same processes in the human cell, such as regulation of apoptosis, even though they may interact with different proteins. Similarly, many studies have focused on a ‘guilt-by-association’ hypothesis, where the involvement of proteins in a particular pathway can be hypothesised in relation to the processes their interacting proteins carry out. To this end GO annotations are integrated in the GEOMI  and Cytoscape  network visualization tools.
A number of proteomic investigations have found that GO data provides an indispensable resource to indicate the success of a particular subcellular enrichment strategy or large scale confocal microscopy analyses [15–18]. Kislinger et al.  used GO data to verify that their subcellular fractionation protocol efficiently isolated subcellular compartments. For example, of the nearly 600 proteins detected exclusively in the nuclear fractions, nearly half were either annotated solely to the nucleus or had a function known to be localized within the nucleus (e.g. transcription factors ). Barbe et al.  applied GO annotation to validate the protein subcellular locations identified using protein-specific antibodies and large scale confocal microscopy analysis, and in this case 80% of the subcellular locations identified in human cell lines were supported by existing GO annotation data .
Despite the wide variety of applications that GO is used for, there are many aspects to biological processes that are not addressed by this database. In particular, GO only describes the normal, physiological function of a protein, rather than the pathological function of a protein in a diseased situation. Furthermore, the dynamic relationships between protein function, its cellular location (including its intracellular location, cell specificity and developmental specific expression) and how this relationship influences the biological processes a protein is involved in are not currently represented by GO. Protein interaction databases such as BioGrid , Biomolecular Interaction Network Database , Human Protein Reference Database  and IntAct  enable complex protein interaction networks to be investigated. However, at present there is no single database that enables complex biological relationships to be investigated. The content of GOC database is influenced by the curation groups who are submitting data, therefore some groups of organisms, such as viruses, are not well represented by this database. Details about some of the other available ontologies, such as cell type or human disease ontologies, are available at the Open Biomedical Ontologies web site (www.obofoundry.org).
A large range of applications have been developed specifically for the visualization of GO and its associated annotation data, and the computational and statistical analysis of large datasets with GO [4,23]. Currently there are 48 tools for gene expression and microarray analysis and 20 GO browsers, all of which are listed on the GOC tools web page (www.geneontology.org/GO.tools.shtml). In addition to their inclusion in dedicated GO browsers, GO annotations are also imported into many of the top biological databases, including UniProtKB, Ensembl, EntrezGene and GeneCards.
To enable the scientific community to effectively use the GO vocabularies and annotations, a number of web-based tools have been developed by both members of the GOC and third parties to search, browse and view the GO hierarchy and annotations (Table 1). Such browsers include AmiGO (amigo.geneontology.org/cgi-bin/amigo/go.cgi), QuickGO (www.ebi.ac.uk/ego) and the MGI GO browser (www.informatics.jax.org/searches/GO_form.shtml). Each of these browsers has its own unique features and varies slightly in the functionality available to users. All three of these browsers have a variety of hyperlinks from the protein records to the GO term records and to the appropriate publication references, and AmiGO and QuickGO both provide a variety of filtering options to modify the search query, enabling users to tailor their download set. In AmiGO, the GOC browser, the GO term records clearly show the number of proteins annotated with that term by all GOC databases, however the protein records only include manual annotation data. The QuickGO browser, produced by the GOA group at the European Bioinformatics Institute (EBI), provides a simple view of the GO annotation data available, including data derived from both manual and electronic pipelines. Users can also tailor the annotation set displayed by specifying lists of sequence identifiers or GO terms, and use QuickGO to map between sequence identifier types (www.ebi.ac.uk/GOA/annotationexample.html). The MGI database is the primary source of mouse GO annotations, and its ‘Gene Ontology Classifications’ records provide three different views of the associated terms, a text summary, a tabulated list and a graphical view. Alternatively, for high-throughput analysis tools entire sets of annotations can be downloaded from the GOC (www.geneontology.org/GO.current.annotations.shtml) or EBI (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/) and following the launch of the Cardiovascular GO Annotation Initiative (www.cardiovasculargeneontology.com)  funded by the British Heart Foundation, a cardiovascular-specific dataset is now also available as a single download (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/bhf-ucl/).
Many tools have been developed to allow users to perform a bulk query of GO using a list of gene, protein or probe identifiers, that have been identified from proteomics or other high-throughput experiments [4,23]. A full list of these tools is available at the GOC web site (www.geneontology.org/GO.tools.shtml#micro).
In theory, text-mining software can be used to extract Gene Ontology terms from the scientific literature and associate these to individual gene records [25,26]. However, the written style of scientific papers represents a substantial obstacle that needs to be overcome for such ‘automated’ methods to be successful. Papers are written as interesting text rather than as a list of results and conclusions. Consequently, a single function for one gene product may be discussed using a series of similar yet non-overlapping descriptions. The result is that although text mining is proving to be a useful curation aid for locating papers, it is unable to provide correct, detailed descriptions of the functions and processes with which a gene product is involved .
Manual gene annotation is an expensive and time-consuming alternative, but does ensure high-quality annotation is achieved. The most accurate, detailed source of functional information for annotation would ideally be provided by scientists working in relevant fields, rather than by a curator who may be unfamiliar with the specific details of a set of genes. However, as annotation guidelines tend to be highly detailed, and annotation a highly time-consuming activity for many bench scientists, bringing the research community closer to the annotation process is a difficult yet valuable goal for curation projects [28,29].
The Cardiovascular GO Annotation Initiative is prioritising the manual annotation of human genes implicated with cardiovascular processes. In order to do this a list of over 4000 cardiovascular-associated genes has been compiled, drawing on the expertise of several advisors for the project  (www.cardiovasculargeneontology.com). A concentrated effort to improve the information content provided in the manual GO annotation of genes involved in the cardiovascular system is now underway, this will ensure a comprehensive, up-to-date summary of the current literature for each gene is available .
For the Cardiovascular GO Annotation Initiative to have a substantial impact in this area of biology, it is important that experts from the cardiovascular community are consulted to ensure that the current accumulated knowledge has been comprehensively reviewed and correctly summarised by the dedicated curation team. Consequently a variety of online facilities have been put in place to encourage scientists to contribute to this initiative  (Table 1). Scientists interested in contributing can either simply supply the curators with details of key experimental publications for curation or can provide more detailed information or commentary, such as reviewing particular annotation sets and pointing out any experimental data that might be missing, wrong or controversial. This information can be sent to GO curators either by direct email to the GO Annotation teams based at UCL, GOannotation@ucl.ac.ukail, or EBI, firstname.lastname@example.org, using the online user feedback form www.ebi.ac.uk/GOA/contactus.html, or by editing wiki pages set up for this purpose wiki.geneontology.org/index.php/Cardiovascular. Submitted data is regularly reviewed and scientists are notified when their contributions have been incorporated.
The Cardiovascular GO Annotation Initiative is currently the only GO annotation project focused on a specific field of human biology. The analysis of the large datasets post-genomics research offers the potential to rapidly advance our understanding of both the basic and clinical aspects of atherosclerosis. Hopefully the cardiovascular community will recognise the potential of this resource, actively contribute to it and reap the rewards in the near future!
Thanks to Varsha Khodiyar for critical reading of this manuscript. The Cardiovascular GO Annotation Initiative, funded by the British Heart Foundation (SP/07/007/23671) and the GOA Project, funded by a P41 grant from the National Human Genome Research Institute (NHGRI, grant HG002273) and the British Heart Foundation (SP/07/007/23671), are both GO Consortium members. GO annotations made by the authors are included in the GOC database.