Genome-wide analyses provide an unprecedented amount of data leading to new interpretation challenges. Classical microarrays can monitor the expression of potentially all genes within a cell or a tissue sample. More recently, new applications have been developed. They include chromatin-immunoprecipitation-chip (ChIP-on-Chip), analysis of alternative splicing (Exon array), characterization of the methylome, polymorphism genotyping (SNP array), copy-number measurements (CGH array) and genome resequencing (for review [1
]). A great interest in the statistical analysis of these 'Omics' data has emerged and many methodologies have been developed. However, if the inferential statistics analyses are now guided by consensual methods [3
], the descriptive analysis is often succinct if not neglected. Two reasons can be advanced: (i) the great volume of information makes difficult the interpretation of the results, and (ii) heterogeneous data and multiple sources of information are difficult to integrate in a global analysis. Methods that overcome these difficulties are necessary as the understanding of a biological phenomenon would greatly benefit from considering simultaneously several types of 'Omics' data and particularly with biological knowledge. This could be done in a multidimensional exploratory approach.
In a multidimensional exploratory approach, a microarray data set is usually analyzed by multivariate analysis (MVA) among which Principal Components Analysis (PCA) is the most used. PCA is well adapted to the framework of 'Omics' data as it can handle data sets with much more variables (genes) than samples (arrays). To analyze simultaneously several data sets, the proper way is to use MVA's dedicated to the analysis of multi-way data tables; the method of reference being the generalized canonical analysis (GCA) [5
]. In the field of microarray, GCA is however limited by the problem of multi-colinearity. To bypass this limitation, only two alternatives have still been proposed: the generalized co-inertia analysis (CIA) [6
] and the recently applied regularized canonical correlation analysis (RCCA) [9
The need for integrating external information in MVA to ease the interpretation of microarray data have also been pointed out. As proposed by Busold et al
], Fagan et al
] superimposed Gene Ontology (GO) terms as supplementary elements onto CIA projections. In this study, GO terms are formalized as boolean vectors that are projected onto CIA plots after matrix transformations. Although CIA approach provide good results in combining molecular data sets, the way GO terms are added is not straightforward and appears incomplete. Indeed, this method codes the links between genes and GO terms and do not take into account the microarray values or molecular data of the genes related to each GO term. Other computational methods, such as gene set enrichment analysis (GSEA) [13
], have shown the importance of focusing on groups of genes as opposed to individual genes for incorporating biological information and gene sets into microarray data analysis. Following this philosophy, a proper integration of biological information in MVA will gain in accuracy by grouping genes into knowledge-related modules, and thus by considering a 'modular approach' [14
]. Such an approach studies as a whole the behavior and structure of a biological process in addition to analyzing its components (genes and/or gene products) individually.
In this article, we suggest to use Multiple Factor Analysis (MFA) in the sense of Escofier-Pagès [16
] to integrate bio-molecular data sets as well as informations on the genes structured in modules. MFA is dedicated to the simultaneous exploration of multi-way data sets where the same individuals are described by several groups of variables. MFA is commonly applied to sensory and ecology data and it has already been applied to the analysis of metabonomic data [18
]. MFA can be related to GCA and CIA since it could be considered as a particular generalized canonical analysis were the inertia criteria replaces the correlation criteria. These methods display a low-dimensional projection of the data highlighting the main sources of variability. Results should therefore be interpreted with caution as sources of variability are not always due to specific biological factors of interest. It is also important to note that at the sample level, the structures provided either by MFA or CIA are highly similar [6
]. The assets of MFA appear when integrating both numerical and categorical groups of variables, and when supplementary groups of data need to be added in the analysis. Here, we present our approach by introducing the basis of MFA and we state how MFA is particularly well adapted to integrate formalized biological knowledge. We illustrate our method with a glioma study [19
] performed with both CGH array and expression microarray on the same tumor samples. Results shows that both DNA copy number alteration and transcriptome data sets induce a good separation of the gliomas according to the WHO classification. The superimposition of the gene modules built since GO annotation identify regulatory mechanisms implicated in gliomagenesis. We also show that our approach can handle a single data set with associated GO annotations and therefore be used as an exploratory tool in the case of classical single 'Omics' study. Finally we present another illustration focused on a nutrition study in mice and integrating microarray and lipidomic data.