High-throughput experimental techniques such as DNA microarrays or proteomics are allowing researchers to study biologic systems from a global perspective. In many cases, the net result of these experiments is a large list of genes or proteins that are potentially interesting for the analyzed system, for example genes that are differentially expressed among normal and pathologic tissues. A logical further step in the analysis workflow is to translate such lists of significant genes into functional descriptors that help researchers in the process of elucidating the biologic meaning of their experimental results.
Since Khatri and coworkers introduced Onto-Express [
1], several methods have been proposed within this context, aimed at interpreting and extracting biologic knowledge from large lists of genes or proteins. Most of these applications find biologic annotations that are significantly enriched in a list of genes with respect to a reference set, usually the whole genome or those genes used in a microarray. Using a specific source of information, for example Gene Ontology (GO) [
2], those tools first find all of the GO terms associated with the set of analyzed genes. The number of appearances of each term is then determined in the input and reference lists, and a statistical test - usually the hypergeometric, χ
2, bionomial, or Fisher's exact test - is used to compute
p values, which are subsequently adjusted for multiple testing. The result of this analysis is a list of single biological annotations from a given ontology (for instance, GO terms) with their corresponding
p values. Those terms with
p values indicating statistical significance are representative of the analyzed list of genes and can provide information about the underlying biologic processes. Good reviews of such methods are available elsewhere [
3,
4].
Most of the currently available tools, however, are designed to evaluate single annotations, which means that they provide a list of annotations with their corresponding p values without taking into account the potential relationships among them. Finding relationships among annotations based on co-occurrence patterns can extend our understanding of the biologic events associated with a given experimental system. For example, a set of differentially expressed genes may be associated with the activation of biologic processes that are restricted to certain cellular organelles. Retrieving such associations provides meaningful and additional information for the interpretation of the experimental results.
In addition, the analysis of single annotations may show limitations in some cases. A simple motivating example of such limitations can be explained by using a hypothetical case of GO terms. There are categories such as 'signal transduction' that, although related to concrete aspects of the cell physiology, are associated with genes that are involved in disparate biologic processes, and therefore they may be annotated together with other terms such as 'cell proliferation' or 'apoptosis'. In this scenario, in a list of genes annotated as 'signal transduction' and 'cell proliferation', we may find that none of these terms are significant because a large number of genes in the genome belonging to each one of these categories are not included in the analyzed set. On the contrary, the co-occurrence of both categories might be significant if most of the genes simultaneously annotated with both terms are included in the list. This co-occurrence information reveals that a significant proportion of genes in the set are involved in specific signaling pathways related to cell proliferation. Therefore, relevant associations might be underestimated if only single annotations are taken into account.
These observations prompted us to develop GENECODIS, a web-based tool for finding sets of biological annotations that frequently appear together and are significant in a set of genes. It allows the integrated analysis of annotations from different sources (for example, KEGG pathways, Swiss-Prot keywords, GO, and InterPro motifs) and generates statistical rank scores for single annotations and their combinations. We believe that GENECODIS is an important extension of existing tools for the functional analysis of gene lists. GENECODIS is publicly available from the application website [
5].