In systems biology, it is particularly important to simultaneously analyze different types of data sets, specifically if the different kind of biological variables are measured on the same samples. Such an analysis enables a real understanding on the relationships between these different types of variables, for example when analyzing transcriptomics, proteomics or metabolomics data using different platforms. Few approaches exists to deal with these high throughput data sets. The application of linear multivariate models such as Partial Least Squares regression (PLS, [1
]) and Canonical Correlation Analysis (CCA, [2
]), are often limited by the size of the data set (ill-posed problems, CCA), the noisy and the multicollinearity characteristics of the data (CCA), but also the lack of interpretability (PLS). However, these approaches still remain extremely interesting for integrating data sets. First, because they allow for the compression of the data into 2 to 3 dimensions for a more powerful and global view. And second, because their resulting components and loading vectors capture dominant and latent properties of the studied process. They may hence provide a better understanding of the underlying biological systems, for example by revealing groups of samples that were previously unknown or uncertain. PLS is an algorithmic approach that has often been criticized for its lack of theoretical justifications. Much work still needs to be done to demonstrate all statistical properties of the PLS (see for example [3
] who recently addressed some theoretical developments of the PLS). Nevertheless, this computational and exploratory approach is extremely popular thanks to its efficiency.
Recent integrative biological studies applied Principal Component Analysis, or PLS [5
], but for a regression framework, where prior biological knowledge indicates which type of omic data is expected to explain the other type (for example transcripts and metabolites). Here, we specifically focus on a canonical correlation framework, when there is either no assumption on the relationship between the two sets of variables (exploratory approach), or when a reciprocal relationship between the two sets is expected (e.g
. cross platform comparisons). Our interests lie in integrating these two high dimensional data sets and perform variable selection simultaneously. Some sparse associated integrative approaches have recently been developed to include a built-in selection procedure. They adapt lasso penalty [7
] or combine lasso and ridge penalties (Elastic Net, [8
]) for feature selection in integration studies.
In this study, we propose to apply a sparse canonical approach called "sparse PLS" (sPLS) for the integration of high throughput data sets. Methodological aspects and evaluation of sPLS in a regression framework were presented in [9
]. This novel computational method provides variable selection of two-block data sets in a one step procedure, while integrating variables of two types.
When applying canonical correlation-based methods, most validation criteria used in a regression context are not statistically meaningful. Instead, the biological relevancy of the results should be evaluated during the validation process. In this context, we compare sparse PLS with two other canonical approaches: penalized CCA adapted with Elastic Net (CCA-EN [10
]), which is a sparse method that was applied to relate gene expression with gene copy numbers in human gliomas, and Co-Inertia Analysis (CIA, [11
]) that was first developed for ecological data, and then for canonical high-throughput biological studies [12
]. This latter approach does not include feature selection, which has to be performed in a two-step procedure. This comparative study has two aims. First to better understand the main differences between each of these approaches and to identify which method would be appropriate to answer the biological question, second to highlight how each method is able to reveal the underlying biological processes inherent to the data. This type of comparative analysis renders biological interpretation mandatory to strengthen the statistical hypothesis, especially when there is a lack of statistical criteria to assess the validity of the results. We first recall some canonical correlation-based methods among which the two sparse methods, sPLS and CCA-EN will be compared with CIA on the NCI60 cell lines data set. We propose to use appropriate graphical representations to discuss the results. The different gene lists are assessed, first with some statistical criteria, and then with a detailed biological interpretation. Finally, we discuss the pros and cons of each approach before concluding.
Canonical correlation-based methods
We focus on two-block data matrices denoted X(n × p) and Y (n × q), where the p variables xj and q variables yk are of two types and measured on the same samples or individuals n, for j = 1 ... p and k = 1 ... q. Prior biological knowledge on these data allows us to settle into a canonical framework, i.e. there exists a reciprocal relationship between the X variables and the Y variables. In the case of high throughput biological data, the large number of variables may affect the exploratory method, due to numerical issues (as it is the case for example with CCA), or lack of interpretability (PLS).
We next recall three types of multivariate methods (CCA, PLS, CIA). For CCA and PLS, we describe the associated sparse approaches that were proposed, either to select variables from each set or to deal with the ill-posed problem commonly encountered in high dimensional data sets.
Canonical Correlation Analysis [2
] studies the relationship between two sets of data. The CCA n
-dimensional score vectors (Xah
) come in pairs to solve the objective function:
where the p
- and q
-dimensional vectors ah
are called canonical factors, or loading vectors, and h
is the CCA chosen dimension. As cor
) = cov
, the aim of CCA is to simultaneously maximize cov
) and minimize the variances of Xah
. It is known that the CCA loadings are not directly interpretable [13
]. It is however very instructive to interpret these components by calculating the correlation between the original data set X
, ..., aH
} and similarly between Y
, ..., bH
}, to project variables onto correlation circles. Easier interpretable graphics are then obtained, as shown in the R package cca [14
In the p
framework, CCA suffers from high dimensionality as it requires the computation of the inverse of two covariance matrices XX'
and YY '
that are singular. This implies numerical difficulties, since the canonical correlation coefficients are not uniquely defined. One solution proposed by [15
] was to ntroduce l2
penalties in a ridge CCA (rCCA) on the covariance matrices, so as to make them invertible. rCCA was recently applied to genomic data [16
], but was not adapted in our study as it does not perform feature selection. We focused instead of another variant called CCA with Elastic Net penalization (see below).
Partial Least Squares regression [1
] is based on the simultaneous decomposition of X
into latent variables and associated loading vectors. The latent variables methods (e.g
. PLS, Principal Component Regression) assume that the studied system is driven by a small number of n
-dimensional vectors called latent variables. These latter may correspond to some biological underlying phenomenons which are related to the study [17
]. Like CCA, the PLS latent variables are linear combinations of the variables, but the objective function differs as it is based on the maximization of the covariance:
where Xh-1 is the residual (deflated) X matrix for each PLS dimension h. We denote ξh and ωh the n-dimensional vectors called "latent variables" which are associated to each loading vector ah and bh. In contrary to CCA, the loading vectors (ah, bh) are interpretable and can give information about how the xj and yk variables combine to explain the relationships between X and Y. Furthermore, the PLS latent variables (ξh, ωh) indicate the similarities or dissimilarities between the individuals, related to the loading vectors.
Many PLS algorithms exist, not only for different shapes of data (SIMPLS, [18
], PLS1 and PLS2 [1
], PLS-SVD [19
]) but also for different aims (predictive, like PLS2, or modelling, like PLS-mode A, see [10
]). In this study we especially focus on a modelling aim ("canonical mode
") between the two data sets, by deflating X
in a symmetric way (see Additional file 1
] proposed a sparse penalized variant of CCA using Elastic Net [8
] for a canonical framework. To do so, the authors used the PLS-mode A formulation [20
] to introduce penalties. Note that Elastic Net is well adapted to this particular context. It combines the advantages of the ridge regression, that penalizes the covariance matrices XX'
' which become non singular, and the lasso [7
] that allows variable selection, in a one step procedure. However, when p
is very large, the resolution of the optimization problem requires intensive computations, and [8
] proposed instead to perform a univariate thresholding, that leaves only the lasso estimates to compute (see Additional file 1
] proposed a sparse PLS approach (sPLS) based on a PLS-SVD variant, so as to penalize both loading vectors ah
For any matrix M (p × q) of rank r, the SVD of M is given by:
where the columns of A
) and B
) are orthonormal and contain the eigenvectors of MM
' and M
, Δ (r
) is a diagonal matrix of the squared eigenvalues of MM
' or M
. Now if M
, then the column vectors of A
) correspond to the loading vectors of the PLS ah
). Sparsity can then be introduced by iteratively penalizing ah
with a soft-thresholding penalization, as [23
] proposed for a sparse PCA using SVD computation. Both regression and canonical deflation modes were proposed for sPLS [9
]. In this paper, we will focus on the canonical mode only (see Additional file 1
for more details of the algorithm). The regression mode has already been discussed in [9
] with a thorough biological interpretation of the results.
Co-Inertia analysis (CIA) was first introduced by [11
] in the context of ecological data, before being applied to high throughput biological data by [12
]. CIA is suitable for a canonical framework, as it is adapted for a symmetric analysis. It involves analyzing each data set separately either with principal component analyses, or with correspondence analyses, such that the covariance between the two new sets of projected scores vectors (that maximize either the projected variability or inertia) is maximal. This results in two sets of axes, where the first pair of axes are maximally co-variant, and are orthogonal to the next pair [24
]. CIA does not propose a built-in variable selection, but we can perform instead a two-step procedure by ordering the weight vector (loadings) for each CIA dimension and by selecting the top variables.
Differences between the approaches
These three canonical based approaches, CCA-EN, sPLS and CIA profoundly differ in their construction, and hence their aims. On the one hand, CCA-EN looks for canonical variate pairs (Xah
), such that a penalized version of the canonical correlation is maximized. This explains why a non monotonic decreasing trend in the canonical correlation can sometimes be obtained [10
]. On the other hand, sPLS (canonical mode) and CIA aim at maximizing the covariance between the scores vectors, so that there is a strong symmetric relationship between both sets. However, here CIA is based on the construction of two Correspondence Analyses, whereas sPLS is based on a PLS analysis.
In CCA-EN, the authors proposed to tune the penalty parameters for each dimension, such that the canonical correlation cor
) is maximized. In practice, they showed that the correlation did not change much when more variables were added in the selection. Therefore, an appropriate way of tuning the parameters would be to choose instead the degree of sparsity (i.e
. the number of variables to select), as previously proposed for sparse PCA by [22
]-see the elasticnet R package for example, and hence to rely on the biologists needs. Thus, depending on the aim of the study (focus on few genes or on groups of genes such as whole pathways) and on the ability to perform follow-up studies, the size of the selection can be adapted. When focusing on groups of genes (e.g
. pathways, transcription factor targets, variables involved in the same biological process), we believe that the selection should be large enough to avoid missing specific functions or annotations. The same strategy will be used for sPLS (see also [9
] where the issue of tuning sPLS parameters is addressed). No other parameters than the number of selected variables is needed in CIA either.
Graphical representations are crucial to help interpreting the results. We therefore propose to homogenize all outputs to enable their comparison.
Samples are represented with the scores or latent variable vectors, in a superimposed manner, as proposed in the R package ade4 [25
], first to show how samples are clustered based on their biological characteristics, and second to measure if both data sets strongly agree according to the applied approach. In these graphical representations, each sample is indicated using an arrow. The start of the arrow indicates the location of the sample in the X
data set in one plot, and the tip of the arrow the location of the sample in the Y
data set in the other plot. Thus, short (long) arrows indicate if both data sets strongly agree (disagree) between the two data sets.
Variables are represented on correlation circles, as previously proposed by [14
]. Correlations between the original data sets and the score or latent variable vectors are computed so that highly correlated variables cluster together in the resulting graphics. Only the selected variables in each dimension are represented. This type of graphic not only allows for the identification of interactions between the two types of variables, but also for identifying the relationship between variable clusters and associated sample clusters. Note that for large variable selections, the use of interactive plotting, color codes or representations limited to user-selected variables may be required to simplify the outputs.
Data sets and relevance for a canonical correlation analysis
We chose to compare the three canonical correlation-based methods (CCA-EN, CIA and sPLS) for their ability to highlight the relationships between two gene expression data sets both obtained on a panel of 60 cell lines (NCI60) from the National Cancer Institute (NCI). This panel consists of human tumor cell lines derived from patients with leukaemia (LE), melanomas (ME) and cancers of ovarian (OV), breast (BR), prostate (PR), lung (LU), renal (RE), colon (CO) and central nervous system (CNS) origin. The NCI60 is used by the Developmental Therapeutics Program (DTP) of the NCI to screen thousands of chemical compounds for growth inhibition activity and it has been extensively characterized at the DNA, mRNA, protein and functional levels. The data sets considered here have been generated using Affymetrix [26
] or spotted cDNA [28
] platforms. These data sets are highly relevant to an analysis in a canonical framework since 1) there is some degree of overlap between the genes measured by the two platforms, but also a large degree of complementarity through the screening of different gene sets representing common pathways or biological functions [12
] and 2) they play fully symmetric roles, as opposed to a regression framework where one data set is explained by the other. We assume that the data sets are correctly normalized, as described below.
The Ross Data set
] used spotted cDNA microarrays containing 9,703 human cDNAs to profile each of the 60 cell line in the NCI60 panel [28
]. Here, we used a subset of 1,375 genes that has been selected using both non-specific and specific filters described in [29
]. In particular, genes with more than 15% of missing values were removed and the remaining missing values were imputed by k
-nearest neighbours [12
]. The pre-processed data set containing log ratio values is available in [12
The Staunton Data set
Hu6800 Affymetrix microarrays containing 7,129 probe sets were used to screen each of the 60 cell lines in another study [26
]. Pre-processing steps are described in [27
] and [12
]. They include 1) replacing average difference values less than 100 by an expression value of 100, 2) eliminating genes whose expression was invariant across all 60 cell lines and 3) selecting the subset of genes displaying a minimum change in expression across all 60 cell lines of at least 500 average difference units. The final analyzed data set contained the average difference values for 1,517 probe sets, and is available in [12
Application of the three sparse canonical correlation-based methods
We applied CCA-EN, CIA and sPLS to the Ross (X) and Staunton (Y) data sets. For each dimension h, h = 1 ... 3, we selected 100 genes from each data set. The number of dimensions was arbitrarily chosen, as when H ≥ 4, the analysis of the results becomes difficult given the high number of graphical outputs. Indeed, for higher dimensions, the cell lines did not cluster by their tissue of origin, which made their interpretation more difficult. The size of the selection (100) was judged small enough to allow for the identification of individual relevant genes and large enough to reveal gene groups belonging to the same functional category or pathway.