Genes do not operate alone within the cell, but in a intricate network of interactions that we have only recently started to envisage [1
]. It is a widely accepted fact that coexpressing genes tend to be fulfilling common roles in the cell [4
]. Moreover, coexpression seems to occur, in many cases, in contiguous chromosomal regions [6
] and furthermore, recent evidences suggest that functionally related genes map close in the genome, even in higher eukaryotes [7
]. Many higher-order levels of interaction are continuously being discovered and even complex traits, including diseases, have started to be considered from a systems biology perspective [8
]. In this scenario, a clear need exists for methods and tools which can help to understand large-scale experiments (microarrays, proteomics, etc.) and to formulate genome-scale hypothesis (evolution, architecture of the interactome, etc.) from a systems biology perspective [9
]. Thus, the functional interpretation of genome-scale data in this context must be taken within a systems biology framework, in which the collective properties of groups of functionally-related genes are considered.
DNA microarray technology can be considered a paradigm among genome-scale experimental methodologies. Its extensive use has fuelled the development of tools for the functional interpretation of such experiments. These tools study the enrichment of functional terms shown by groups of genes defined by experimentally determined gene expression levels. Programs such as ontoexpress [10
], FatiGO [11
], GOMiner [12
], etc., can be considered representatives of a family of methods designed for this purpose [13
]. The difficulties for defining repeatable lists of genes of interest across laboratories and platforms using common experimental and statistical methods [15
] has led several researchers to propose different approaches which aim to select blocks of genes with known common functional properties.
Thus, the Gene Set Enrichment Analysis (GSEA) [16
], although not free of criticisms [18
], pioneered a family of methods conceived to search for groups of functionally related genes with a coordinate over- or under-expression across a list of genes, ranked by their differential expression, coming from microarray experiments. Different tests have recently been proposed for this purpose [19
] and also for ESTs [25
]. Nevertheless, it is surprising that, despite the abundance and availability of genome-scale data, the notion of testing entities more complex than single genes (such as blocks of functionally related genes) has not been applied in fields other than microarray data analysis. In fact, any genome-scale data in which some measurement is available for individual genes can be analysed in a similarly conceptual way.
Here we "officially" present the FatiScan program, which implements a segmentation test [19
] that allows studying many relevant functional terms, which include Gene Ontology (GO) [26
], KEGG pathways [27
] and many others, along with a sophisticated system for the visualisation of results. Although FatiScan had been mentioned in previous papers dealing with generalities of the GEPAS [28
] and Babelomics [29
] program packages, a proper detailed description of FatiScan and their possibilities was not available to date. FatiScan can deal with ordered lists of genes independently from the nature of the experiment that originated the data or the method used to rank the genes. This interesting property allows for its application to other type of data apart from microarrays. We show how FatiScan can be applied to different genome-scale datasets such as protein-protein interaction networks or to test functional evolutionary hypotheses. We also show how conclusions on the molecular roles fulfilled by the genes can be reached by taking into account the functional interplay of genes in the cell as defined by their shared biological properties.
Threshold-based functional profiling
The interpretation of genome-scale data is usually performed in two steps: in a first step, genes of interest are selected (for example, in microarray experiments, because they are significantly over- or under-expressed when two classes of experiments are compared), and then, the enrichment of any type of biologically relevant term in these genes with respect to a background (typically the rest of the genes) is studied. In the active field of microarray data analysis, there are different available tools, such as Oncomine [10
], FatiGO [11
] and others [13
], that use different functionally relevant terms taken from different curated repositories (GO [26
], KEGG pathways [27
], etc.) It has been noted that this strategy causes an enormous loss of information due to the large number of false negatives that are accepted in order to preserve a low ratio of false positives (and the noisier the data the worse the effect) [16
Threshold-free functional profiling
Under a systems biology perspective, a threshold-based approach to understanding the molecular basis of a genome-scale experiment is far from being efficient. Methods that draw inspiration from systems biology focus on functional classes such as blocks of genes that act cooperatively rather than on single entities such as genes. These strategies use lists of genes ranked by any biological criteria (e.g. differential expression when comparing cases and healthy controls, genes with different evolutionary rates, etc.) and directly search for the distribution of blocks of functionally related genes across such list [16
]. Any macroscopic observation that causes this ranking in the list of genes will be a consequence of the cooperative action of genes arranged into functional classes (GO, pathways, etc.) Each functional class "responsible" for the macroscopic observation will, consequently, be found in the extremes of the ranking with the highest probability. Figure illustrates this concept. Let's imagine that a list of genes is ranked by differential expression between two experimental conditions (A and B in the figure). If the position of the genes belonging to different functional classes is studied (columns 1, 2 and 3 in Figure ) it is evident that the functional class represented in the first column is completely uncorrelated with the arrangement, while the other two are clearly associated to high expression in the experimental conditions B and A, respectively. If, for example, the two experimental conditions were diseased versus healthy controls, column 1 could correspond to a functional class related to housekeeping processes. Consequently, the genes corresponding to this functional class would be active in both conditions (healthy and diseased) and will be scattered across the list. Conversely, columns 2 and 3 would correspond to biological processes much more active in diseased cases (B) or in healthy controls (A), respectively. If thresholds were imposed to select genes differentially expressed (dotted lines in Figure ), and genes over this threshold were compared to the rest for enrichment in these functional classes, the chance of finding a significant enrichment in this pre-selection of genes would be much lower, if not null. The imposition of a previous threshold based on experimental values that ignores the cooperation among genes is thus avoided under this threshold-free perspective.
Figure 1 Threshold-free functional analysis. A list of genes is ranked by their differential expression between two experimental conditions (A and B) using, for example, a t-test which is applied individually to each gene. Columns 1, 2 and 3 represent the position (more ...)