With the development of high throughput technologies, such as microarray and next generation sequencing data, the exploration of high throughput data sets is becoming a necessity to unveil the relevant information contained in the data. Efficient exploratory tools are therefore needed, not only to assess the quality of the data, but also to give a comprehensive overview of the system, extract significant information and cope with the high dimensionality. Indeed, many statistical approaches fail or perform poorly for two main reasons: the number of samples (or observations) is much smaller than the number of variables (the biological entities that are measured) and the data are extremely noisy.
In this study, we are interested in the application of unsupervised approaches to discover novel biological mechanisms and reveal insightful patterns while reducing the dimension in the data. Amongst the different categories of unsupervised approaches (clustering, model-based and projection methods), we are specifically interested in projection-based methods, which linearly decompose the data into components with a desired property. These exploratory approaches project the data into a new subspace spanned by the components. They allow dimension reduction without loss of essential information and visualization of the data in a smaller subspace.
Principal component analysis (PCA) [1
] is a classical tool to reduce the dimension of expression data, to visualize the similarities between the biological samples, and to filter noise. It is often used as a pre-processing step for subsequent analyses. PCA projects the data into a new space spanned by the principal components (PC), which are uncorrelated and orthogonal. The PCs can successfully extract relevant information in the data. Through sample and variable representations, they can reveal experimental characteristics, as well as artefacts or bias. Sometimes, however, PCA can fail to accurately reflect our knowledge of biology for the following reasons: a) PCA assumes that gene expression follows a multivariate normal distribution and recent studies have demonstrated that microarray gene expression measurements follow instead a super-Gaussian distribution [2
], b) PCA decomposes the data based on the maximization of its variance. In some cases, the biological question may not be related to the highest variance in the data [6
A more plausible assumption of the underlying distribution of high-throughput biological data is that feature measurements following Gaussian distributions represent noise - most genes conform to this distribution as they are not expected to change at a given physiological or pathological transition [7
]. Recently, an alternative approach called Independent Component Analysis (ICA) [8
] has been introduced to analyze microrray and metabolomics data [2
]. In contrary to PCA, ICA identifies non-Gaussian components which are modelled as a linear combination of the biological features. These components are statistically independent, i.e. there is no overlapping information between the components. ICA therefore involves high order statistics, while PCA constrains the components to be mutually orthogonal, which involves second order statistics [14
]. As a result, PCA and ICA often choose different subspaces where the data are projected. As ICA is a blind source signal separation, it is used to reduce the effects of noise or artefacts of the signal since usually, noise is generated from independent sources [10
]. In the recent literature, it has been shown that the independent components from ICA were better at separating different biological groups than the principal components from PCA [2
]. However, although ICA has been found to be a successful alternative to PCA, it faces some limitations due to some instability, the choice of number of components to extract and high dimensionality. As ICA is a stochastic algorithm, it needs to be run several times and the results averaged in order to obtain robust results [5
]. The number of independent component to extract and choose is a hard outstanding problem. It has been the convention to use a fixed number of components [2
]. However, ICA does not order its components by 'relevance'. Therefore, some authors proposed to order them either with respect to their kurtosis values [9
], or with respect to their l2
], or by using Bayesian frameworks to select the number of components [15
]. In the case of high dimensional data sets, PCA is often applied as a pre-processing step to reduce the number of dimensions [2
]. In that particular case, ICA is applied on a subset of data summarized by a small number of principal components from PCA.
In this paper, we propose to use ICA as a denoising process of PCA, since ICA is good at separating mixed signals, i.e. noise vs. no noise. The aim is to generate denoised loading vectors. These vectors are crucial in PCA or ICA as each of them indicates the weights assigned to each biological feature in the linear combination that leads to the component. Therefore, the goal is to obtain independent components that better reflect the underlying biology in a study and achieve better dimension reduction than PCA or ICA.
Independent Principal Component Analysis (IPCA) makes the assumption that biologically meaningful components can be obtained if most noise has been removed in the associated loading vectors.
In IPCA, PCA is used as a pre-processing step to reduce the dimension of the data and to generate the loading vectors. The FastICA algorithm [9
] is then applied on the previously obtained PCA loading vectors that will subsequently generate the Independent Principal Components (IPC). We use the kurtosis measure of the loading vectors to order the IPCs. We also propose a sparse variant with a built-in variable selection procedure by applying soft-thresholding on the independent loading vectors [16
In the 'Results and Discussion' Section, we first compare the classical PCA and ICA methodologies to IPCA on a simulation study. On three real biological datasets (microarray and metabolomics datasets) we demonstrate the satisfying samples clustering abilities of IPCA. We then illustrate the usefulness of variable selection with sIPCA and compare it with the results obtained from the sparse PCA from [18
]. In the 'Methods' Section, we present the PCA, ICA and IPCA methodologies and describe how to perform variable selection with sIPCA.