Functional annotations, such as GO or KEGG pathways, have been used for the definition of modules of genes in functional enrichment methods [
1,
9,
10]. The detection of such functional modules within lists of genes by means of different tests relies upon the implicit assumption that common functionality implies a high degree of co-expression among all the members of each module [
25]. While this assumption can be considered true as a general observation, it does not necessarily imply that the conventional definitions of functional classes used for this purpose (GO, KEGG, etc.) do all correspond to co-expressing sets of genes. It was previously reported that a large number of functional modules showed a low degree of internal co-expression, contradicting thus the expected cooperation among the genes to carry out their functions together [
17,
18]. Despite this observation, a systematic study on the degree of internal co-expression of the most commonly used functional modules and the impact of this bias on real biological data has not been carried out to date. Here we aimed a redefinition of functional modules, understood as groups of genes carrying out, cooperatively, a function in the cell. It is widely recognized that the biological circumstance of coexpression of two genes is properly defined by the coefficient of correlation among them [
26]. So, we use it here to measure gene coordinate activity within a functional module. In this paper we present a general methodology to quantify the strength of the internal correlation of a functional module and we propose a simple way of using this information for functional profiling purposes that allows finding functional modules activated or deactivated that would remain otherwise unnoticed.
We have derived the correlation structure of the largest possible fraction of the human transcriptome, estimating its parameters from measurements from 3034 DNA microarrays stored in public data repositories. One of the strengths of the present study is, precisely, the big sample size (especially large if the difficulties in finding comparable microarrays in the databases are considered [
27]) on which all estimations relay on. Of not less importance is the wide range of biological conditions considered in the study which includes several types of normal tissues, different kinds of cancer cells, male and female individuals as well as different cell lines. In order to ensure as much as possible the compatibility of the data gathered for the analysis, we have used one of the more extensively used expression arrays currently available (Affymetrix HG U133 Plus 2.0). For the same reasons, we have only collected datasets for which raw data were available so we could normalize and pre-process all of them together with the same method. This collection of samples constitutes a large dataset that allows us to perform a robust profiling of a large fraction of the human transcriptome, covering an ample spectrum of clinically and biologically relevant conditions.
The correlation structure of the transcriptome has been used to derive a coherence score which measures the internal co-expression of 173 KEGG pathways and 2221 GO terms. Our estimations indicate that only 57% of the KEGG pathways and just 32% of the GO terms can be considered to have internal correlation stronger than random modules of functionally unrelated genes of the same size. We also provided separate estimates for each of the Gene Ontologies (30% in BP; 30% in MF; 46% in CC), showing that, in general, GO Biological Processes or Molecular Function have a weaker internal correlation than KEGG pathways or GO Cellular Component. Another interesting finding was the fact that many modules have high internal correlation but also high variability.
Different reasons may account for these observations. In some cases there are functional modules defined in GO that are composed by independent or even antagonistic sub-modules and, consequently, their genes will never be found co-expressing in any experiment. Examples are transporters, which are composed by different independent types of sub-modules or any GO term starting by "regulation of", which usually has two antagonistic descendants called "positive regulation of" and "negative regulation of". In other cases, there are functional modules that require of a core of genes for properly carrying out the function and other genes of the module are only activated under particular physiological conditions, stresses, etc., displaying a lower degree of correlation. Modules composed by sub-modules can also exist, and many other situations can be imagined. In any case, the vision of a functional module as a discrete class, to which genes belong or do not belong to, is definitively not supported by the observations made. Thus, it is urgent to take a new approach that accounts for the non-discrete nature of the functional modules as defined by the most commonly used functional annotations (GO and KEGG).
In addition we highlighted how the level of annotation of a GO term in the ontology structure may not be the most suitable indicator, at least in terms of co-regulation, of the described function, despite being often used as a measure of its specificity.
Under the above mentioned considerations, most currently used functional profiling methods which model functional modules as groups of co-expressing genes, seem clearly inappropriate. The need of new methodologies for functional profiling and, above all, the essential requirement of a new notion of membership of a gene to a functional module is still an open issue. The proposed coherence score can be used in a first instance as a filtering criterion when the aim is to relate functionality to gene expression by discarding functional modules that will never be found as co-expressing units. Beyond this obvious use, this index can also be used to derive a weighting scheme that introduces the idea of non-discrete functional modules within the context of functional profiling methodologies in a straightforward manner. The proposed weighting scheme has the desirable property of using information on gene coexpression in the algorithm when such information is available but not introducing any bias when the information is missing. Relying on this new concept and using gene expression correlation information, we have shown with two examples how the proposed weighted approach discovers GO terms and pathways unnoticed under the equivalent standard un-weighted functional profiling method.
The approach shown here is quite general and could easily be extended to any other species or different platforms just by calculating the corresponding correlation matrix in a straightforward manner. The methodology could also be easily extended to any other types of modules defined by functionality, regulatory motifs, etc. Obviously, the use of newer strategies for functional profiling such as the different versions of gene set enrichment analysis [
11-
14,
16], would benefit of considering this weighted definition of functional modules instead of using the classical categorical, un-weighted definitions.
Although the weighting schema proposed is quite simple, it proves efficient in finding functional modules in a standard functional enrichment analysis framework [
1,
10], as shown by the examples. Obviously, these examples have only an illustrative purpose of the application of the method that uses information on gene coexpression to improve functional module detection. However, in the worst scenario in absence of such information, this approach would be strictly equivalent to a conventional functional enrichment test and, therefore, its application would be equally valid. The use of most sophisticated weighting schemes, in which the continuous distribution of values of co-expression of all the genes in the module (and possibly outside the module) were taken into account, would probably improve even more the results. Also, a similar philosophy could also be applied to improve the detection of modules in gene-set enrichment methods although it falls beyond the scope of this manuscript.