Each of these tools uses one or more annotation databases and creates a list of function categories in which the genes from the input list are known to be involved in. The functional categories that are overly represented in a statistically significant way in the list of differentially regulated genes are inferred to be meaningfully related to the condition under study. However, this approach of translating a list of differentially expressed genes into a list of functional categories using annotation databases suffers from a few important limitations. Since these limitations are related to the approach itself, all current tools exhibit them.
Firstly, the existing annotations databases are incomplete. For virtually all sequenced organisms only a subset of known genes are functionally annotated (King et al., 2003
). Furthermore, most annotation databases are built by curators who manually review the existing literature. Although unlikely, it is possible that certain known facts might get temporarily overlooked. For instance, we found references in literature published in the early 90s, for 65 functional annotations that are yet not included in the current functional annotation databases. As an example, the gene HMOX2
was shown to be involved in the process of pigment biosynthesis in 1992 (McCoubrey et al., 1992
) and is still not annotated as such today. More commonly, recent annotations are not in the databases yet because of the time lag necessary for the manual curation process.
Secondly, certain pieces of information may also be imprecise or incorrect. In the GO, out of 19 490 total biological process annotations available for Homo sapiens
, 11 434 associations are inferred exclusively from electronic annotations (i.e. without any expert human involvement) (http://www.geneontology.org/GO.current.annotations.shtml
). The vast majority of such electronic annotations are reasonably accurate (Camon et al., 2005
). However, many such annotations are often made at very high-level GO terms which limits their usefulness. Furthermore, some of these inferences are incorrect (King et al., 2003
; Wang et al., 2004
). Even though in some cases the error is very conspicuous to a human expert, currently, there are no automated techniques that could analyze, discover and correct such erroneous assignments. At the present time, none of the tools allows any type of weighting by the type of evidence which is a limitation since experimentally derived annotations are more trustworthy than electronically inferred ones.
The current approach used for ontological analysis is limited to looking up existing annotations and performing a significance analysis for the categories found. This approach cannot discover previously unknown functions for known genes even if there is data justifying such inferences. For example, the gene SLC13A2 [solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 (H.sapiens)] encodes the human Na(+)-coupled citrate transporter and is annotated in GO for the molecular function organic anion transporter activity. However, it is not annotated for the corresponding biological process, organic anion transport. This is not a problem for the curator, and the human expert querying GO for this specific gene. For them, it is obvious that a gene that has organic anion transporter activity will be involved in the organic anion transport. However, a query that tries to find all genes involved in the process of organic anion transport will fail to retrieve this gene. Similarly, any ontological analysis software trying to find out what underlying processes are represented by a given list of genes containing this gene, will either fail to consider the organic anion transport if no other genes are involved in it, or will calculate its statistical significance incorrectly by ignoring this gene.
Another limitation is related to those genes that are involved in several biological processes. For such genes, all current tools weight all the biological processes equally. At the moment, it is not possible to single out the more relevant one by using the context of the other genes differentially expressed in the current experiment. BRCA1 for instance, is a well known tumor suppressor but is also known to be involved in carbohydrate metabolism. If most other genes found to be changed in the current experiment are involved in processes such as DNA damage response, apoptosis, induction of apoptosis, and signal transduction, it is perhaps more likely that in this experiment BRCA1 is playing its usual tumor suppressor role. However, if most other genes are involved in carbohydrate mediated signaling, carbohydrate transport and metabolism, etc., then it is perhaps more likely that BRCA1's role in the carbohydrate metabolism is more relevant.
The existing GO based functional profiling approaches are currently decoupled from the gene expression data obtained from the microarray experiment in the previous step. In any given biological phenomenon, different genes are regulated to different extents. The data providing information about different amount of regulation for one gene versus another gene can be useful in assigning different weights to the corresponding biological processes they are involved in and hence, can help in inferring if one biological process is more relevant than the other(s).
The usefulness of the existing functional profiling approaches is impacted by the annotation bias present in the ontological annotation databases. Some biological processes are studied in more detail than the others (e.g. apoptosis), thus generating more data. If more data about a specific biological process is available, more of the genes associated with it will be known and hence, the process is more likely to appear as significant than the others.
An important issue related to the ontological analysis is the name–space mapping from one resource to another. At the moment, the existing knowledge about known genes is spread out over a number of databases and other resources. Different databases are maintained by various independent groups that many times have very different interest and research foci. Each such resource often uses its own type of identifiers. For instance, GenBank uses accession numbers, UniGene uses cluster identifiers (IDs), Entrez Gene uses gene IDs, SWISSPROT uses protein IDs, TrEMBL accession numbers, etc. Furthermore, genes are also represented by various company-specific gene IDs. A typical example would be Affymetrix which uses its own probe IDs to represent various genes. Various resources try to address the problem by maintaining other types of IDs together with their own and by providing ad hoc tools able to map from one type of ID to another. For instance, besides its own gene names, Entrez-Gene database also contains UniGene cluster IDs, and Affymetrix's NetAffyx provides RefSeq and GenBank accession numbers, besides its own array specific probe IDs. For example, the gene beta actin in mouse is referred to as MGI:87904 in Mouse Genome Informatics (MGI), Actb (Gene ID: 11461) in Entrez Gene, Mm.297 in UniGene, ACTB_MOUSE (primary accession number: P60710) in UniProt, and TC1242885 in the TIGR gene index. In addition, the beta actin gene in mouse is referred to by 29 mRNA sequences and 4552 ESTs in dbEST, 5 secondary accession numbers in UniProt, 4 other accession IDs in MGI, and 5 probe IDs on 4 different Affymetrix mouse arrays. The burden of mapping various types of ID on each other is left entirely on the shoulders of the researchers, who often have to revert to cutting-and-pasting lists of IDs from one database to another.
The name–space issue becomes crucial when trying to translate from lists of differentially regulated genes to functional profiles because the mapping from one type of identifier to another is not one-to-one. In consequence, the type of IDs used to specify the list of differentially regulated genes can potentially affect the results of the analysis (Drăghici, 2003
; Khatri et al., 2004
). While GO represents a viable, long term solution to the problem of inconsistent vocabulary, the name–space problem is yet to be solved.