Recently, technological developments have led to a situation where data analysts in different domains face data that are more and more complex. A special case of complex data are coupled data that consist of different data matrices for the same set of variables or experimental units. In systems biology, an example of matrices sharing the same set of variables is the study of the expression profile of a certain organism (e.g., Saccharomyces cerevisae
) on the basis of on the one hand different microarray compendia that can be downloaded from public repositories, and on the basis of, on the other hand, ChIP-chip or motif data [1
]. An example of data matrices with shared experimental units are metabolomics data (e.g., the metabolome of Escherichia coli
) gathered from different fermentations using mass spectrometry (MS) with different MS data sets being available from different separation methods (e.g., gas chromatography and liquid chromatography [3
]). In the first example each of the data matrices provides information on the same transcriptome and in the second example on the same set of metabolites, with some parts of the information being common for the different data matrices and some parts being specific: For example, gas chromatography mass spectrometry (GC/MS) and liquid chromatography MS (LC/MS) in general measure both a few classes of common compounds and many classes of compounds that are measured by one of the two methods only [3
A major challenge for researchers dealing with such coupled data, is to represent them in such a way that both shared and specific information as contained in the different data matrices is captured (with all information in question pertaining to variance within each of the matrices under study). For example, in the case of coupled gene expression and ChIP-chip data one may wish to retrieve modules of genes that have the same transcription factors and that are co-regulated under the same conditions, which is common information as contained in the transcriptome and ChIP-chip data matrices; in the metabolomics example, a coupled data analysis of gas and liquid chromatography MS data should allow to highlight the classes of compounds that are measured by both separation methods, as well as those that are measured by only one of them.
Several tools are available that can be used for the analysis of coupled data. Here we will focus on methods that simultaneously extract components from all data blocks. Examples of such methods include SUM-PCA [5
], unrestricted PCovR (Gurden: Multiway covariates regression, unpublished), SCA-P [6
], multiple factor analysis [7
], and STATIS [8
]. Whereas all these methods are based on the idea of a simultaneous component extraction they have been developed independently in different disciplines (including chemometrics and psychometrics) and rely on different terminologies and mathematical frameworks. As a consequence, comparing them is not straightforward. The primary objective of this paper is to provide a structured overview in which all the methods fit, and to highlight their common core and particularities.
The paper starts by introducing some terminology to delineate the types of data to which the methods are applicable; then, a general framework is introduced that encompasses all the different simultaneous component methods and that frames them mathematically into a principal components setting. Then, each of the methods is discussed with respect to this framework. An application is presented on simultaneous components analyses of gas and liquid chromatography MS data; in this application we compare the results obtained by applying the different methods and discuss how to interpret the results obtained by one of the methods (multiple factor analysis).