Home | About | Journals | Submit | Contact Us | Français |

**|**Stat Appl Genet Mol Biol**|**PMC2861323

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Sparse CCA
- 3. Sparse multiple CCA
- 4. Sparse supervised CCA
- 5. Discussion
- References

Authors

Related links

Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 28.

Published online 2009 June 9. doi: 10.2202/1544-6115.1470

PMCID: PMC2861323

Copyright © 2009 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.

*Canonical correlation analysis* (CCA), due to Hotelling (1936), is a classical method for determining the relationship between two sets of variables. Given two data sets **X**_{1} and **X**_{2} of dimensions *n* × *p*_{1} and *n* × *p*_{2} on the same set of *n* observations, CCA seeks linear combinations of the variables in **X**_{1} and the variables in **X**_{2} that are maximally correlated with each other. That is, **w**_{1}
$\mathbb{\text{R}}$^{p1} and **w**_{2}
$\mathbb{\text{R}}$^{p2} maximize the *CCA criterion*, given by

$${\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}\text{subject}\text{to}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{1}{\text{w}}_{1}={\text{w}}_{2}^{T}{\text{X}}_{2}^{T}{\text{X}}_{2}{\text{w}}_{2}=1,$$

(1)

where we assume that the columns of **X**_{1} and **X**_{2} have been standardized to have mean zero and standard deviation one. In this paper, we will refer to **w**_{1} and **w**_{2} as the canonical vectors (or weights), and we will refer to **X**_{1}**w**_{1} and **X**_{2}**w**_{2} as the canonical variables.

In recent years, CCA has gained popularity as a method for the analysis of genomic data. It has become common for researchers to perform multiple assays on the same set of patient samples; for instance, DNA copy number (or comparative genomic hybridization, CGH), gene expression, and single nucleotide polymorphism (SNP) data might all be available. Examples of studies involving two or more genomic assays on the same set of samples include Hyman et al. (2002), Pollack et al. (2002), Morley et al. (2004), Stranger et al. (2005), and Stranger et al. (2007). In the case of, say, DNA copy number and gene expression measurements on a single set of patient samples, one might wish to perform CCA in order to identify genes whose expression is correlated with regions of genomic gain or loss. However, genomic data is characterized by the fact that the number of features generally greatly exceeds the number of observations; for this reason, CCA cannot be applied directly.

To circumvent this problem, Parkhomenko et al. (2007), Waaijenborg et al. (2008), Parkhomenko et al. (2009), Le Cao et al. (2009), and Witten et al. (2009) have proposed methods for *penalized CCA*. In this paper, we will restrict ourselves to the criterion proposed in Witten et al. (2009), which takes the form

$$\begin{array}{l}{\text{maximize}}_{{\text{w}}_{1},\hspace{0.17em}{\text{w}}_{2}}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}\\ \text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,{P}_{1}({\text{w}}_{1})\le {c}_{1},{P}_{2}({\text{w}}_{2})\le {c}_{2}\end{array}$$

(2)

where *P*_{1} and *P*_{2} are convex penalty functions. Since *P*_{1} and *P*_{2} are generally chosen to yield **w**_{1} and **w**_{2} sparse, we call this the *sparse CCA criterion*. This criterion follows from applying penalties to **w**_{1} and **w**_{2} and also from assuming that the covariance matrix of the features is diagonal; that is, we replace
${\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{1}{\text{w}}_{1}$ and
${\text{w}}_{2}^{T}{\text{X}}_{2}^{T}{\text{X}}_{2}{\text{w}}_{2}$ in the CCA criterion with
${\text{w}}_{1}^{T}{\text{w}}_{1}$ and
${\text{w}}_{2}^{T}{\text{w}}_{2}$. The sparse CCA criterion results in **w**_{1} and **w**_{2} unique, even when *p*_{1}*, p*_{2} *n*, for appropriate choices of *P*_{1} and *P*_{2}.

It has been shown that sparse CCA can be used to identify genes that have expression that is correlated with regions of DNA copy number change (Waaijenborg et al. 2008, Witten et al. 2009), to identify genes that have expression that is correlated with SNPs (Parkhomenko et al. 2009), and to identify sets of genes on two different microarray platforms that have correlated expression (Le Cao et al. 2009). However, some questions remain:

- Sometimes, in addition to data matrices
**X**_{1}$\mathbb{\text{R}}$^{n}^{×}^{p1}and**X**_{2}$\mathbb{\text{R}}$^{n×p2}, a vector of outcome measurements in $\mathbb{\text{R}}$is also available. For instance, a survival time might be known for each patient. CCA and sparse CCA are^{n}*unsupervised*methods; that is, they do not make use of an outcome. However, if outcome measurements are available, then one might seek sets of variables in the two data sets that are correlated with each other and associated with the outcome. - More than two sets of variables on the same set of observations might be available. For instance, it is becoming increasingly common for researchers to collect gene expression, SNP, and DNA copy number measurements on the same set of patient samples. In this case, an extension of sparse CCA to the case of more than two data sets is required.

In this paper, we develop extensions to sparse CCA that address these situations and others.

The rest of this paper is organized as follows. Section 2 contains methods for sparse CCA when the data consist of matrices **X**_{1} and **X**_{2}. In Section 2.1, we present details of the sparse CCA method from Witten et al. (2009), and in Section 2.2, we explain the connections between that method and those of Waaijenborg et al. (2008), Le Cao et al. (2009), and Parkhomenko et al. (2009). The remainder of Section 2 contains some extensions of sparse CCA for two sets of features on a single set of observations. Section 3 contains an explanation of *sparse multiple CCA*, an extension of sparse CCA to the case of *K* data sets **X**_{1}*, ...,* **X*** _{K}* with features on a single set of samples. In Section 4, we present

The sparse CCA criterion was given in Equation (2) for general penalty functions *P*_{1} and *P*_{2}. We will be interested in two specific forms of these penalty functions:

*P*_{1}is an*L*_{1}(or*lasso*) penalty; that is,*P*_{1}(**w**_{1}) = ||**w**_{1}||_{1}. This penalty will result in**w**_{1}sparse for*c*_{1}chosen appropriately. We assume that $1\le {c}_{1}\le \sqrt{{p}_{1}}$.*P*_{1}is a*fused lasso*penalty (see e.g. Tibshirani et al. 2005), of the form*P*_{1}(**w**_{1}) = ∑_{j}*|w*_{1}_{j}*|*+ ∑_{j}*|w*_{1}–_{j}*w*_{1(}_{j}_{–1)}*|*. This penalty will result in**w**_{1}sparse and smooth, and is intended for cases in which the features in**X**_{1}have a natural ordering along which smoothness is expected.

In order to indicate the form of the penalties *P*_{1} and *P*_{2} in use, we will refer to the method as sparse CCA(*P*_{1}, *P*_{2}). That is, if both penalties are *L*_{1}, then we will call this sparse CCA(*L*_{1}, *L*_{1}), and if *P*_{1} is an *L*_{1} penalty and *P*_{2} a fused lasso penalty, then we will call it sparse CCA(*L*_{1}, FL) (where “FL” indicates fused lasso). Note that when *P*_{1} and *P*_{2} are *L*_{1} or fused lasso penalties, the resulting canonical vectors are unique, even when *p*_{1}*, p*_{2} *n*. Witten et al. (2009) propose the use of sparse CCA(*L*_{1}, FL) in the case where **X**_{1} corresponds to gene expression measurements and **X**_{2} corresponds to copy number measurements (ordered by position along the chromosomes); this is related to the proposal of Tibshirani & Wang (2008) for estimating copy number for a single CGH sample.

Now, consider the criterion (2) with *P*_{1} and *P*_{2} convex penalty functions. With **w**_{1} fixed, the criterion is convex in **w**_{2}, and with **w**_{2} fixed, it is convex in **w**_{1}. The objective function of this *biconvex* criterion increases in each step of a simple iterative algorithm.

- Initialize
**w**_{2}to have*L*_{2}norm 1. - Iterate the following two steps until convergence:
**w**_{1}← arg max_{w1}${\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}$ subject to*||***w**_{1}*||*^{2}≤ 1*, P*_{1}(**w**_{1}) ≤*c*_{1}.**w**_{2}← arg max_{w2}${\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}$ subject to ||**w**_{2}||^{2}≤ 1,*P*_{2}(**w**_{2}) ≤*c*_{2}.

If *P*_{1} is an *L*_{1} penalty, then the update has the form

$${\text{w}}_{1}\leftarrow \frac{S({\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2},{\Delta}_{1})}{\parallel S({\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2},{\Delta}_{1}){\parallel}_{2}},$$

(3)

where Δ_{1} = 0 if this results in *||***w**_{1}*||*_{1} ≤ *c*_{1}; otherwise, Δ_{1} *>* 0 is chosen so that *||***w**_{1}*||*_{1} = *c*_{1}. Here, *S*(·) denotes the soft-thresholding operator; that is, *S*(*a, c*) = sgn(*a*)(*|a| – c*)_{+}. Soft-thresholding arises in the update due to the *L*_{1} penalty and the assumption that the covariance matrices are independent. Δ_{1} can be chosen by a binary search. If *P*_{1} is instead a fused lasso penalty, then a slightly modified version of the sparse CCA criterion yields the update step

$${\text{w}}_{1}\leftarrow {\text{argmin}}_{{\text{w}}_{1}}\{\frac{1}{2}\parallel {\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}-{\text{w}}_{1}{\parallel}^{2}+{\lambda}_{1}\sum _{j}\left|{w}_{1j}\right|+{\lambda}_{2}\sum _{j}|{w}_{1j}-{w}_{1(j-1)}|\},$$

(4)

which can be computed using software implementing fused lasso regression. **w**_{2} can be updated analogously.

Methods for selecting tuning parameter values and assessing significance of the resulting canonical vectors are presented in Appendix A. The above algorithm is easily extended to obtain multiple canonical vectors, as described in Witten et al. (2009) and summarized in Appendix B. However, to simplify interpretation of the examples presented in this paper, we will only consider the first canonical vectors **w**_{1} and **w**_{2}, as given in the criterion (2).

This paper extends the sparse CCA proposal of Witten et al. (2009). As mentioned earlier, the Witten et al. (2009) method is closely related to a number of other methods for sparse CCA. We briefly review those methods here.

Waaijenborg et al. (2008) first recast classical CCA as an iterative regression procedure; then an elastic net penalty is applied in order to obtain penalized canonical vectors. An approximation of the iterative elastic net procedure results in an algorithm that is similar to that of Witten et al. (2009) in the case of *L*_{1} penalties on **w**_{1} and **w**_{2}. However, Waaijenborg et al. (2008) do not appear to be exactly optimizing a criterion.

Parkhomenko et al. (2009) develop an iterative algorithm for estimating the singular vectors of
${\text{X}}_{1}^{T}{\text{X}}_{2}$. At each step, they regularize the estimates of the singular vectors by soft-thresholding. Though they do not explicity state a criterion, it appears that they are approximately optimizing a criterion that is related to (2) with *L*_{1} penalties. However, they use the Lagrange form, rather than the bound form, of the constraints on **w**_{1} and **w**_{2}. Their algorithm is closely related to that of Witten et al. (2009), though extra normalization steps are required due to computational problems with the Lagrange form of the constraints. The algorithm of Le Cao et al. (2009) is also closely related to those of Parkhomenko et al. (2009) and Witten et al. (2009), though again Le Cao et al. (2009) use the Lagrange form, rather than the bound form, of the penalties.

Hence, the Waaijenborg et al. (2008), Parkhomenko et al. (2009), Le Cao et al. (2009) and Witten et al. (2009) methods are all closely related; we pursue the criterion (2) in this paper.

The sparse CCA method will result in canonical vectors **w**_{1} and **w**_{2} that are sparse, if the penalties *P*_{1} and *P*_{2} are chosen appropriately. However, the nonzero elements of **w**_{1} and **w**_{2} may be of different signs. In some cases, one might seek a sparse weighted average of the features in **X**_{1} that is correlated with a sparse weighted average of the features in **X**_{2}. Then one will want to additionally restrict the elements of **w**_{1} and **w**_{2} to be nonnegative (or nonpositive). If we require the elements of **w**_{1} and **w**_{2} to be nonnegative, the sparse CCA criterion becomes

$$\begin{array}{ll}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}& {\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}\text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,\\ & {w}_{1j}\ge 0,{w}_{2j}\ge 0,{P}_{1}({\text{w}}_{1})\le {c}_{1},{P}_{2}({\text{w}}_{2})\le {c}_{2},\end{array}$$

(5)

and the resulting algorithm is as follows:

- Initialize
**w**_{2}to have*L*_{2}norm 1. - Iterate the following two steps until convergence:
**w**_{1}*←*arg max_{w1}${\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}$ subject to*||***w**_{1}*||*^{2}≤ 1*, w*_{1}≥ 0_{j}*, P*_{1}(**w**_{1}) ≤*c*_{1}.**w**_{2}*←*arg max_{w2}${\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}$ subject to*||***w**_{2}*||*^{2}≤ 1*, w*_{2}≥ 0_{j}*, P*_{2}(**w**_{2}) ≤*c*_{2}.

Consider the criterion (5) with **w**_{1} fixed; we can write the optimization problem for **w**_{2} with
${\text{X}}_{2}^{T}{\text{X}}_{1}{\text{w}}_{1}=\text{a}$ as

$${\text{minimize}}_{{\text{w}}_{2}}-{\text{a}}^{T}{\text{w}}_{2}\text{subject}\text{to}\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,{w}_{2j}\ge 0,{P}_{2}({\text{w}}_{2})\le {c}_{2}.$$

(6)

Assume that *P*_{2} is an *L*_{1} penalty. It is obvious that if *a** _{j}* ≤ 0, then

$${\text{minimize}}_{{w}_{2j}:{a}_{j}>0}-\sum _{j:{a}_{j}>0}{a}_{j}{w}_{2j}\text{subject}\text{to}\sum _{j:{a}_{j}0}{w}_{2j}^{2}\le 1,\sum _{j:{a}_{j}0}\left|{w}_{2j}\right|\le {c}_{2}.$$

(7)

This can be solved using the following update for **w**_{2}:

$${\text{w}}_{2}\leftarrow \frac{S({({\text{X}}_{2}^{T}{\text{X}}_{1}{\text{w}}_{1})}_{+}{,\Delta}_{2})}{\parallel S({({\text{X}}_{2}^{T}{\text{X}}_{1}{\text{w}}_{1})}_{+},{\Delta}_{2}){\parallel}_{2}},$$

(8)

where Δ_{2} = 0 if this results in *||***w**_{2}*||*_{1} ≤ *c*_{2}; otherwise, Δ_{2} *>* 0 is chosen so that *||***w**_{2}*||*_{1} = *c*_{2}. An analogous update step can be derived for **w**_{1} if *P*_{1} is an *L*_{1} penalty.

We demonstrate the sparse CCA method on the lymphoma data set of Lenz et al. (2008), which consists of gene expression and array CGH measurements on 203 patients with DLBCL. The data set is publicly available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11318. There are 17350 gene expression measurements and 386165 copy number measurements. (In the raw data set, more gene expression measurements are available. However, we limited the analysis to genes for which we knew the chromosomal location, and we averaged expression measurements for genes for which multiple measurements were available.) For computational reasons, sets of adjacent CGH spots on each chromosome were averaged before all analyses were performed. In previous research, gene expression profiling has been used to define three subtypes of DLBCL, called germinal center B-cell-like (GCB), activated B-cell-like (ABC), and primary mediastinal B-cell lymphoma (PMBL) (Alizadeh et al. 2000, Rosenwald et al. 2002). For each of the 203 observations, survival time and DLBCL subtype are known.

For chromosome *i*, we performed sparse CCA(*L*_{1}, FL) using **X**_{1} equal to expression data of genes on all chromosomes and **X**_{2} equal to DNA copy number data on chromosome *i*. Tuning parameter values were chosen by permutations; details are given in Appendix A. P-values obtained using the method in Appendix A, as well as the chromosomes on which the genes corresponding to nonzero **w**_{1} weights are located, can be found in Table 1. Canonical vectors found on almost all chromosomes were significant, and for the most part, *cis* interactions were found. Cis interactions are those for which the regions of DNA copy number change and the sets of genes with correlated expression are located on the same chromosome. The presence of cis interactions is not surprising because copy number gain on a given chromosome could naturally result in increased expression of the genes that were gained.

We used the CGH and expression canonical variables as features in a multivariate Cox proportional hazards model to predict survival. Note that **X**_{1}**w**_{1} and **X**_{2}**w**_{2} are vectors in
$\mathbb{\text{R}}$* ^{n}*. We also used the canonical variables as features in a multinomial logistic regression to predict cancer subtype. The resulting p-values are shown in Table 1. The Cox proportional hazards models predicting survival from the canonical variables were not significant on most chromosomes. However, on many chromosomes, the canonical variables were highly predictive of DLBCL subtype. Boxplots showing the canonical variables as a function of DLBCL subtype are displayed in Figure 1 for chromosomes 6 and 9. For chromosome 9, Figure 2 shows

Sparse CCA was performed using CGH data on a single chromosome and all gene expression measurements. For chromosomes 6 and 9, the gene expression and CGH canonical variables, stratified by cancer subtype, are shown. It is clear that the values of the **...**

Sparse CCA was performed using CGH data on chromosome 9, and all gene expression measurements. The samples with the highest and lowest absolute values in the CGH canonical variable are shown, along with the canonical vector corresponding to the CGH data. **...**

We can also compare the sparse CCA canonical variables obtained on the DLBCL data to the first principal components that arise if principal components analysis (PCA) is performed separately on the expression data and on the copy number data. PCA and sparse CCA were performed using all of the gene expression data, and the CGH data on chromosome 3; Figure 3 shows the resulting canonical variables and principal components. Sparse CCA results in CGH and expression canonical variables that are highly correlated with each other, due to the form of the sparse CCA criterion. PCA results in principal components that are far less correlated with each other, and, as a result, may yield better separation between the three subtypes. But PCA does not allow for an integrated interpretation of the expression and CGH data together.

Sparse CCA and PCA were performed using CGH data on chromosome 3, and all gene expression measurements. The resulting canonical variables and principal components are shown. The CGH and expression canonical variables are highly correlated with each other. **...**

In this section, we assessed the association between the canonical variables found using sparse CCA and the clinical outcomes in order to determine if the results of sparse CCA have biological significance. However, in general, if a clinical outcome of interest is available, then the sparse sCCA approach of Section 4 may be appropriate.

Consider now a new setting in which we have *n* observations on *p* features, and each observation belongs to one of two classes. Let **X**_{1} denote the *n* × *p* matrix of observations by features, and let **X**_{2} be a binary *n* × 1 matrix indicating class membership of each observation of **X**_{1}. In this section, we will show that sparse CCA applied to **X**_{1} and **X**_{2} yields a canonical vector **w**_{1} that is closely related to the nearest shrunken centroids solution (NSC, Tibshirani et al. 2002, Tibshirani et al. 2003).

Assume that each column of **X**_{1} has been standardized to have mean zero and pooled within-class standard deviation equal to one. NSC is a high-dimensional classification method that involves defining “shrunken” class centroids based on only a subset of the features; each test set observation is then classified to the nearest shrunken centroid. We first explain the NSC method, applied to data **X**_{1}. For 1 ≤ *k* ≤ 2, we define vectors **d*** _{k}*,

$${\text{d}}_{k}=\frac{{\overline{\text{X}}}_{1k}}{{m}_{k}},{\text{d}}_{k}^{\prime}=S({\text{d}}_{k},\delta ),{\overline{\text{X}}}_{1k}^{\prime}={m}_{k}{\text{d}}_{k}^{\prime}.$$

(9)

Here, _{1}* _{k}* is the mean vector of the features in

Now, consider the effect of applying sparse CCA with *L*_{1} penalties to data **X**_{1} and **X**_{2}. Rescale **X**_{2} so that the class 1 values are
$\frac{1}{{n}_{1}}$ and the class 2 values are
$-\frac{1}{{n}_{2}}$. The sparse CCA criterion is

$$\begin{array}{l}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}\\ \text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,\parallel {\text{w}}_{1}{\parallel}_{1}\le {c}_{1},\parallel {\text{w}}_{2}{\parallel}_{1}\le {c}_{2}.\end{array}$$

(10)

Since **w**_{2}
$\mathbb{\text{R}}$^{1}, the constraints on its value result in **w**_{2} = 1. The criterion reduces to

$${\text{maximize}}_{{\text{w}}_{1}}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}\text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{1}{\parallel}_{1}\le {c}_{1},$$

(11)

which can be rewritten as

$${\text{maximize}}_{{\text{w}}_{1}}{({\overline{\text{X}}}_{11}-{\overline{\text{X}}}_{12})}^{T}{\text{w}}_{1}\text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{1}{\parallel}_{1}\le {c}_{1}.$$

(12)

The solution to (12) is

$${\text{w}}_{1}=\frac{S({\overline{\text{X}}}_{11}-{\overline{\text{X}}}_{12},\Delta )}{\parallel S({\overline{\text{X}}}_{11}-{\overline{\text{X}}}_{12},\Delta ){\parallel}_{2}}=\frac{S((1+\frac{{n}_{1}}{{n}_{2}}){\overline{\text{X}}}_{11},\Delta )}{\parallel S((1+\frac{{n}_{1}}{{n}_{2}}){\overline{\text{X}}}_{11},\Delta ){\parallel}^{2}}$$

(13)

where Δ = 0 if that results in *||***w**_{1}*||*_{1} ≤ *c*_{1}; otherwise, Δ *>* 0 is chosen so that *||***w**_{1}*||*_{1} = *c*_{1}. So sparse CCA yields a canonical vector that is proportional to the shrunken centroid **′**_{11} when the tuning parameters for NSC and sparse CCA are chosen appropriately.

CCA and sparse CCA can be used to perform an integrative analysis of two data sets with features on a single set of samples. But what if more than two such data sets are available? A number of approaches for generalizing CCA to more than two data sets have been proposed in the literature, and some of these extensions are summarized in Gifi (1990). We will focus on one of these proposals for multiple-set CCA.

Let the *K* data sets be denoted **X**_{1}, ..., **X*** _{K}*; data set

$$\sum _{i<j}{\text{w}}_{i}^{T}{\text{X}}_{i}^{T}{\text{X}}_{j}{\text{w}}_{j}\text{subject}\text{to}{\text{w}}_{k}^{T}{\text{X}}_{k}^{T}{\text{X}}_{k}{\text{w}}_{k}=1\forall k,$$

(14)

where **w*** _{k}*
$\mathbb{\text{R}}$

$${\text{maximize}}_{{\text{w}}_{1},\dots ,{\text{w}}_{K}}\sum _{i<j}{\text{w}}_{i}^{T}{\text{X}}_{i}^{T}{\text{X}}_{j}{\text{w}}_{j}\text{subject}\text{to}\parallel {\text{w}}_{i}{\parallel}^{2}\le 1,{P}_{i}({\text{w}}_{i})\le {c}_{i}{\forall}_{i},$$

(15)

where *P _{i}* are convex penalty functions. Then,

It is not hard to see that just as (2) is *biconvex* in **w**_{1} and **w**_{2}, (15) is *multiconvex* in **w**_{1}*, ...,* **w*** _{K}*. That is, with

- For each
*i*, fix an initial value of**w**$\mathbb{\text{R}}$_{i}.^{pk} - Repeat until convergence: For each
*i*, let$${\text{w}}_{i}\leftarrow {\text{argmax}}_{{\text{w}}_{i}}{\text{w}}_{i}^{T}{\text{X}}_{i}^{T}(\sum _{j\ne i}{\text{X}}_{j}{\text{w}}_{j})\text{subject}\text{to}\parallel {\text{w}}_{i}{\parallel}^{2}\le 1,{P}_{i}({\text{w}}_{i})\le {c}_{i}.$$(16)

For instance, if *P _{i}* is an

$${\text{w}}_{i}\leftarrow \frac{S({\text{X}}_{i}^{T}({\sum}_{j\ne i}{\text{X}}_{j}{\text{w}}_{j}),{\Delta}_{i})}{\parallel S({\text{X}}_{i}^{T}({\sum}_{j\ne i}{\text{X}}_{j}{\text{w}}_{j}),{\Delta}_{i}){\parallel}_{2}},$$

(17)

where Δ* _{i}* = 0 if this results in ||

We demonstrate the performance of sparse mCCA on a simple simulated example. Data were generated according to the model

$${\text{X}}_{i}=\text{u}{\text{w}}_{i}^{T}+{\epsilon}_{i},1\le i\le 3$$

(18)

where **u**
$\mathbb{\text{R}}$^{50}, **w**_{1}
$\mathbb{\text{R}}$^{100}, **w**_{2}
$\mathbb{\text{R}}$^{200}, **w**_{3}
$\mathbb{\text{R}}$^{300}. Only the first 20, 40, and 60 elements of **w**_{1}, **w**_{2}, and **w**_{3} were nonzero, respectively. Sparse mCCA was run on this data, and the resulting estimates of **w**_{1}, **w**_{2}, and **w**_{3} are shown in Figure 4.

A permutation algorithm for selecting tuning parameter values and assessing significance of sparse mCCA can be found in Appendix A. In addition, an algorithm for obtaining multiple sparse mCCA factors is given in Appendix B.

If CGH measurements are available on a set of patient samples, then one may wonder whether copy number changes in genomic regions on separate chromosomes are correlated. For instance, certain genomic regions may tend to be coamplified or codeleted.

To answer this question for a single pair of chromosomes, we can perform sparse CCA(FL, FL) with two data sets, **X**_{1} and **X**_{2}, where **X**_{1} contains the CGH measurements on the first chromosome of interest and **X**_{2} contains the CGH measurements on the second chromosome of interest. If copy number change on the first chromosome is independent of copy number change on the second chromosome, then we expect the corresponding p-value obtained using the method of Appendix A not to be small. A small p-value indicates that copy number changes on the two chromosomes are more correlated with each other than one would expect due to chance. However, in general, there are (
$\begin{array}{c}24\\ 2\end{array}$) pairs of chromosomes that can be tested for correlated patterns of amplification and deletion; this leads to a multiple testing problem and excessive computation. Instead, we take a different approach, using sparse mCCA. We apply sparse mCCA to data sets **X**_{1}, ..., **X**_{24}, where **X*** _{i}* contains the CGH measurements on chromosome

This method is applied to the DLBCL data set mentioned previously. We first denoise the samples by applying the fused lasso to each sample individually, as in Tibshirani & Wang (2008). We then perform sparse mCCA on the resulting smoothed CGH data. The canonical vectors that result are shown in Figure 5. From the figure, one can conclude that complex patterns of gain and loss tend to co-occur. It is unlikely that a single sample would display the entire pattern found; however, samples with large values in the canonical variables most likely contain some of the patterns shown in the figure.

In Section 2.4, we determined that on the lymphoma data, many of the canonical variables obtained using sparse CCA are highly associated with tumor subtype (and for some chromosomes, the canonical variables are also associated with survival time). Though an outcome was available, we took an unsupervised approach in performing sparse CCA. In this section, we will develop a framework to directly make use of an outcome in sparse CCA. Our method for *sparse supervised CCA* (sparse sCCA) is closely related to the *supervised principal components analysis* (supervised PCA) method of Bair & Tibshirani (2004) and Bair et al. (2006), and so we begin with an overview of that method.

Principal components regression (PCR; see e.g. Massy 1965) is a method for predicting an outcome **y**
$\mathbb{\text{R}}$* ^{n}* from a data matrix

Bair & Tibshirani (2004) and Bair et al. (2006) propose the use of supervised PCA, which is a semisupervised approach. Their method can be described simply:

- On training data, the features that are most associated with the outcome
**y**are identified. - PCR is performed using only the features identified in the previous step.

Theoretical results regarding the performance of this method under a latent variable model are presented in Bair et al. (2006).

Suppose that a quantitative outcome is available; that is, we have **y**
$\mathbb{\text{R}}$* ^{n}* in addition to

We define the criterion for *supervised CCA* as follows:

$$\begin{array}{ll}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}& {\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}\text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,\\ & {w}_{1j}=0\forall j\notin {Q}_{1},{w}_{2j}=0\forall j\notin {Q}_{2}\end{array}$$

(19)

where *Q*_{1} is the set of features in **X**_{1} that are most correlated with **y**, and *Q*_{2} is the set of features in **X**_{2} that are most correlated with **y**. The number of features in *Q*_{1} and *Q*_{2}, or alternatively the correlation threshold for features to enter *Q*_{1} and *Q*_{2}, can be treated as a tuning parameter or can simply be fixed. If **X**_{1} = **X**_{2}, then the criterion (19) simplifies to supervised PCA; that is, **w**_{1} and **w**_{2} are equal to each other and to the first principal component of the subset of the data containing only the features that are most associated with the outcome.

sCCA can be easily extended to give sparse sCCA,

$$\begin{array}{ll}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}& {\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}\text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,{P}_{1}({\text{w}}_{1})\le {c}_{1},\\ & {P}_{2}({\text{w}}_{2})\le {c}_{2},{w}_{1j}=0\forall j\notin {Q}_{1},{w}_{2j}=0\forall j\notin {Q}_{2},\end{array}$$

(20)

where as usual, *P*_{1} and *P*_{2} are convex penalty functions.

We have discussed the possibility of **y** being a quantitative outcome (e.g. tumor diameter), but other options exist as well. For instance, **y** could be a time to event (e.g. a possibly censored survival time) or a class label (for instance, DLBCL subtype). Our definition of sparse sCCA must be generalized in order to accommodate other outcome types. If **y** is a survival time, then for each feature, we can compute the score statistic (or Cox score) for the univariate Cox proportional hazards model that uses that feature to predict the outcome. Only features with sufficiently high (absolute) Cox scores will be in the sets *Q*_{1} and *Q*_{2}. In the case of a multiple class outcome, only features with a sufficiently high F-statistic for a one-way ANOVA will be in *Q*_{1} and *Q*_{2}. Other outcome types could be incorporated in an analogous way. The algorithm for sparse sCCA can be written as follows:

- Let
**X͂**_{1}and**X͂**_{2}denote the submatrices of**X**_{1}and**X**_{2}consisting of the features in*Q*_{1}and*Q*_{2}.*Q*_{1}and*Q*_{2}are calculated as follows:- In the case of an
*L*_{1}penalty on**w**,_{i}*Q*is the set of indices of the features in_{i}**X**that have highest univariate association with the outcome._{i} - In the case of a fused lasso penalty on
**w**, the vector of univariate associations between the features in_{i}**X**and the outcome is smoothed using the fused lasso. The resulting smoothed vector is thresholded to obtain the desired number of nonzero cofficients._{i}*Q*contains the indices of the coefficients that are nonzero after thresholding._{i}

- Perform sparse CCA using data
**X͂**_{1}and**X͂**_{2}.

Note that the fused lasso case is treated specially because one wishes for the features included in **X͂*** _{i}* to be contiguous, so that smoothness in the resulting

We explore the performance of sparse sCCA with a quantitative outcome on a toy example. Data are generated according to the model

$${\text{X}}_{1}=\text{u}{\text{w}}_{1}^{T}+{\epsilon}_{1},{\text{X}}_{2}=\text{u}{\text{w}}_{2}^{T}+{\epsilon}_{2},\text{y}=\text{u},$$

(21)

with **u**
$\mathbb{\text{R}}$^{50}, **w**_{1}
$\mathbb{\text{R}}$^{500}, **w**_{2}
$\mathbb{\text{R}}$^{1000}, _{1}
$\mathbb{\text{R}}$^{50×500}, _{2}
$\mathbb{\text{R}}$^{50×1000}. 50 elements of **w**_{1} and 100 elements of **w**_{2} are non-zero. The first canonical vectors of sparse CCA and sparse sCCA (using *L*_{1} penalties) were computed for a range of values of *c*_{1} and *c*_{2}. In Figure 6, the resulting number of true positives (features that are nonzero in **w**_{1} and **w**_{2} and also in the estimated canonical vectors) are shown on the y-axis, as a function of the number of nonzero elements of the canonical vectors. It is clear that greater numbers of true positives are obtained when the outcome is used. In Figure 7, the canonical variables obtained using sparse CCA and sparse sCCA are plotted against the outcome. The canonical variables obtained using sparse sCCA are correlated with the outcome, and those obtained using sparse CCA are not. Note that under the model (21), in the absence of noise, the canonical variables are proportional to **u**; therefore, sparse sCCA more accurately uncovers the true canonical variables.

Sparse CCA and sparse sCCA were performed on a toy example. The canonical variables obtained using sparse sCCA are highly correlated with the outcome; those obtained using sparse CCA are not.

In theory, one could choose *Q*_{1} and *Q*_{2} in Step 1 of the sparse sCCA algorithm to contain fewer than *n* features; then, ordinary CCA could be performed instead of sparse CCA in Step 2. However, we recommend using a less stringent cutoff for *Q*_{1} and *Q*_{2} in Step 1, and instead performing further feature selection in Step 2 via sparse CCA.

Given **X**_{1}, **X**_{2}, and a two-class outcome **y**, one could perform sparse mCCA by treating **y** as a third data set. This would yield a different but related method for performing sparse sCCA in the case of a two-class outcome.

Note that the outcome **y** is a matrix in
$\mathbb{\text{R}}$^{n}^{×1}. We code the two classes (of *n*_{1} and *n*_{2} observations, respectively) as
$\frac{\lambda}{{n}_{1}}$ and
$-\frac{\lambda}{{n}_{2}}$. Assume that the columns of **X**_{1} and **X**_{2} have mean zero and pooled within-class standard deviation equal to one. Consider the sparse mCCA criterion with *L*_{1} penalties, applied to data sets **X**_{1}, **X**_{2}, and **y**:

$$\begin{array}{ll}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2},{\text{w}}_{3}}& {\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}+{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}\text{y}{\text{w}}_{3}+{\text{w}}_{2}^{T}{\text{X}}_{2}^{T}\text{y}{\text{w}}_{3}\\ & \text{subject}\text{to}\parallel {\text{w}}_{i}{\parallel}^{2}\le 1,\parallel {\text{w}}_{i}{\parallel}_{1}\le {c}_{i}\forall i.\end{array}$$

(22)

Note that since **w**_{3}
$\mathbb{\text{R}}$^{1}, it follows that **w**_{3} = 1. So we can re-write the criterion (22) as

$$\begin{array}{l}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}+{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}\text{y}+{\text{w}}_{2}^{T}{\text{X}}_{2}^{T}\text{y}\\ \text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,\parallel {\text{w}}_{1}{\parallel}_{1}\le {c}_{1},\parallel {\text{w}}_{2}{\parallel}_{1}\le {c}_{2}.\end{array}$$

(23)

Now, this criterion is biconvex and leads naturally to an iterative algorithm. However, this is not the approach that we take with our sparse sCCA method. Instead, notice that

$${\text{w}}_{1}^{T}{\text{X}}_{1}^{T}\text{y}=\lambda {({\overline{\text{X}}}_{11}-{\overline{\text{X}}}_{12})}^{T}{\text{w}}_{1}=\lambda \sqrt{\frac{1}{{n}_{1}}+\frac{1}{{n}_{2}}}{\text{t}}_{1}^{T}{\text{w}}_{1},$$

(24)

where _{1}* _{k}*
$\mathbb{\text{R}}$

$$\begin{array}{l}{\text{maximize}}_{{\text{w}}_{1},{\text{w}}_{2}}{\text{w}}_{1}^{T}{\text{X}}_{1}^{T}{\text{X}}_{2}{\text{w}}_{2}+\lambda \sqrt{\frac{1}{{n}_{1}}+\frac{1}{{n}_{2}}}({\text{t}}_{1}^{T}{\text{w}}_{1}+{\text{t}}_{2}^{T}{\text{w}}_{2})\\ \text{subject}\text{to}\parallel {\text{w}}_{1}{\parallel}^{2}\le 1,\parallel {\text{w}}_{2}{\parallel}^{2}\le 1,\parallel {\text{w}}_{1}{\parallel}_{1}\le {c}_{1},\parallel {\text{w}}_{2}{\parallel}_{1}\le {c}_{2}.\end{array}$$

(25)

As λ increases, the elements of **w**_{1} and **w**_{2} that correspond to large |**t**_{1}| and |**t**_{2}| values increase in absolute value relative to those that correspond to smaller |**t**_{1}| and |**t**_{2}| values.

Rather than adopting the criterion (25) for sparse sCCA, our sparse sCCA criterion results from assigning nonzero weights only to the elements of **w**_{1} and w_{2} corresponding to large |**t**_{1}| and |**t**_{2}|. We prefer our proposed sparse sCCA algorithm because it is simple, generalizes to the supervised PCA method when **X**_{1} = **X**_{2}, and extends easily to non-binary outcomes.

We evaluate the performance of sparse sCCA on the DLBCL data set, in terms of the association of the resulting canonical variables with the survival and subtype outcomes. We repeatedly split the observations into training and test sets (75% / 25%). Let (
${\text{X}}_{1}^{\mathit{\text{train}}}$,
${\text{X}}_{2}^{\mathit{\text{train}}}$, **y*** ^{train}*) denote the training data, and let (
${\text{X}}_{1}^{\mathit{\text{test}}}$,
${\text{X}}_{2}^{\mathit{\text{test}}}$,

As it becomes more commonplace for biomedical researchers to perform multiple assays on the same set of patient samples, methods for the integrative analysis of two or more high-dimensional data sets will become increasingly important. The sparse CCA methods previously proposed in the literature (Parkhomenko et al. 2007, Waaijenborg et al. 2008, Parkhomenko et al. 2009, Le Cao et al. 2009, Witten et al. 2009) provide an attractive framework for performing an integrative analysis of two data sets. In this paper, we have developed extensions to sparse CCA that can be used to apply the method to the case of more than two data sets, and to incorporate an outcome into the analysis.

The methods proposed in this paper will be available on CRAN as part of the PMA (Penalized Multivariate Analysis) package.

We first present a permutation-based algorithm for selection of tuning parameters and calculation of p-values for sparse CCA. Note that a number of methods have been proposed in the literature for selecting tuning parameters for sparse CCA (see e.g. Waaijenborg et al. 2008, Parkhomenko et al. 2009, Witten et al. 2009). The method proposed here has the advantage over the proposals of Waaijenborg et al. (2008) and Parkhomenko et al. (2009) that it does not require splitting a possibly small set of samples into a training set and a test set. Witten et al. (2009) present a method for tuning parameter selection for their penalized matrix decomposition; however, it does not extend in a straightforward way to the sparse CCA case.

- For each tuning parameter value (generally this will be a two-dimensional vector)
*T*being considered:_{j}- Compute
**w**_{1}and**w**_{2}, the canonical vectors using data**X**_{1}and**X**_{2}and tuning parameter*T*. Compute_{j}*d*= Cor(_{j}**X**_{1}**w**_{1},**X**_{2}**w**_{2}). - For
*i*1, ...,*N*, where*N*is some large number of permutations:- Permute the rows of
**X**_{1}to obtain the matrix ${\text{X}}_{1}^{i}$, and compute canonical vectors ${\text{w}}_{1}^{i}$ and ${\text{w}}_{2}^{i}$ using data ${\text{X}}_{1}^{i}$ and**X**_{2}and tuning parameter*T*._{j} - Compute ${d}_{j}^{i}=\text{Cor}({\text{X}}_{1}^{i}{\text{w}}_{1}^{i},{\text{X}}_{2}{\text{w}}_{2}^{i})$.

- Calculate the p-value ${p}_{j}=\frac{1}{N}{\sum}_{i=1}^{N}{1}_{{d}_{j}^{i}\ge {d}_{j}}$.

- Choose the tuning parameter
*T*corresponding to the smallest_{j}*p*. Alternatively, one can choose the tuning parameter_{j}*T*for which $({d}_{j}-\frac{1}{N}{\sum}_{i}{d}_{j}^{i})/\text{sd}({d}_{j}^{i})$ is largest, where sd ( ${d}_{j}^{i}$) indicates the standard deviation of ${d}_{j}^{1},\dots ,{d}_{j}^{N}$. The resulting p-value is_{j}*p*._{j}

Since multiple tuning parameters *T _{j}* are considered in the above algorithm, a strict cut-off for the p-value

Given the above algorithm, the analogous method for selecting tuning parameters and determining significance for sparse sCCA is straightforward. For simplicity, we assume that the threshold for features to enter *Q*_{1} and *Q*_{2} in the sparse sCCA algorithm is fixed (not a tuning parameter).

- For each tuning parameter (generally this will be a two-dimensional vector)
*T*being considered:_{j}- Compute
**w**_{1}and**w**_{2}, the supervised canonical vectors using data**X**_{1},**X**_{2}, and**y**and tuning parameter*T*. Compute_{j}*d*= Cor(_{j}**X**_{1}**w**_{1}*,***X**_{2}**w**_{2}). - For
*i*1*, ..., N*, where*N*is some large number of permutations:- Permute the rows of
**X**_{1}and**X**_{2}separately to obtain the matrices ${\mathbf{\text{X}}}_{1}^{i}$ and ${\mathbf{\text{X}}}_{2}^{i}$, and compute supervised canonical vectors ${\mathbf{\text{w}}}_{1}^{i}$ and ${\mathbf{\text{w}}}_{2}^{i}$ using data ${\mathbf{\text{X}}}_{1}^{i}$, ${\mathbf{\text{X}}}_{2}^{i}$,**y**, and tuning parameter*T*._{j} - Compute ${d}_{j}^{i}=\text{Cor}({\mathbf{\text{X}}}_{1}^{i}{\mathbf{\text{w}}}_{1}^{i},{\mathbf{\text{X}}}_{2}^{i}{\mathbf{\text{w}}}_{2}^{i})$.

- Calculate the p-value ${p}_{j}=\frac{1}{N}{\sum}_{i=1}^{N}{1}_{{d}_{j}^{i}\ge {d}_{j}}$.

- Choose the tuning parameter
*T*corresponding to the smallest_{j}*p*. Alternatively, one can choose the tuning parameter_{j}*T*for which $({d}_{j}-\frac{1}{N}{\sum}_{i}{d}_{j}^{i})/\text{sd}({d}_{j}^{i})$ is largest, where sd ( ${d}_{j}^{i}$) indicates the standard deviation of ${d}_{j}^{1},\dots ,{d}_{j}^{N}$. The resulting p-value is_{j}*p*._{j}

Note that in the permutation step, we permute the rows of **X**_{1} and **X**_{2} without permuting **y**; this means that under the permutation null distribution, **y** is not correlated with the columns of **X**_{1} and **X**_{2}.

We can similarly use the following permutation-based algorithm to assess the significance of the canonical vectors obtained using sparse mCCA:

- For each tuning parameter (generally this will be a
*K*-dimensional vector)*T*being considered:_{j}- Compute
**w**_{1}*, ...,***w**, the canonical vectors using data_{K}**X**_{1}*, ...,***X**and tuning parameter_{K}*T*. Compute_{j}*d*= ∑_{j}Cor (_{s<t}**X**_{s}**w**,_{s}**X**_{t}**w**)._{t} - For
*i*1, ...,*N*, where*N*is some large number of permutations:- Permute the rows of
**X**_{1}, ...,**X**separately to obtain the matrices ${\mathbf{\text{X}}}_{1}^{i},\dots ,{\mathbf{\text{X}}}_{K}^{i}$, and compute canonical vectors ${\mathbf{\text{w}}}_{1}^{i},\dots ,{\mathbf{\text{w}}}_{K}^{i}$ using data ${\mathbf{\text{X}}}_{1}^{i},\dots ,{\mathbf{\text{X}}}_{K}^{i}$ and tuning parameter_{K}*T*._{j} - Compute ${d}_{j}^{i}={\sum}_{s<t}\text{Cor}({\mathbf{\text{X}}}_{s}^{i}{\mathbf{\text{w}}}_{s}^{i},{\mathbf{\text{X}}}_{t}^{i}{\mathbf{\text{w}}}_{t}^{i})$.

- Calculate the p-value ${p}_{j}=\frac{1}{N}{\sum}_{i=1}^{N}{1}_{{d}_{j}^{i}\ge {d}_{j}}$.

- Choose the tuning parameter
*T*corresponding to the smallest_{j}*p*. Alternatively, one can choose the tuning parameter_{j}*T*for which $({d}_{j}-\frac{1}{N}{\sum}_{i}{d}_{j}^{i})/\text{sd}({d}_{j}^{i})$ is largest, where sd ( ${d}_{j}^{i}$) indicates the standard deviation of ${d}_{j}^{1},\dots ,{d}_{j}^{N}$. The resulting p-value is_{j}*p*._{j}

We first review the method of Witten et al. (2009) for obtaining multiple sparse CCA canonical vectors. Note that the sparse CCA algorithm uses the cross-product matrix
$\mathbf{\text{Y}}={\mathbf{\text{X}}}_{1}^{T}{\mathbf{\text{X}}}_{2}$ and does not require knowledge of **X**_{1} and **X**_{2} individually.

- Let ${\mathbf{\text{Y}}}^{1}\leftarrow {\mathbf{\text{X}}}_{1}^{T}{\mathbf{\text{X}}}_{2}$.
- For
*j*1, ...,*J*:- Compute ${\mathbf{\text{w}}}_{1}^{j}$ and ${\mathbf{\text{w}}}_{2}^{j}$ by applying the single-factor sparse CCA algorithm to data
**Y**.^{j} - ${\mathbf{\text{Y}}}^{j+1}\leftarrow {\mathbf{\text{Y}}}^{j}-({\mathbf{\text{w}}}_{1}^{{j}^{T}}{\mathbf{\text{Y}}}^{j}{\mathbf{\text{w}}}_{2}^{j}){\mathbf{\text{w}}}_{1}^{j}{\mathbf{\text{w}}}_{2}^{{j}^{T}}$.

- ${\mathbf{\text{w}}}_{1}^{j}$ and ${\mathbf{\text{w}}}_{2}^{j}$ are the
*j*canonical vectors.^{th}

To obtain *J* sparse sCCA factors, submatrices **X͂**_{1} and **X͂**_{2} are formed from the features most associated with the outcome; the algorithm for obtaining *J* sparse CCA factors is then applied to this new data.

To obtain *J* sparse mCCA factors, note that the sparse mCCA algorithm requires knowledge only of the (
$\begin{array}{c}K\\ 2\end{array}$) cross-product matrices of the form
${\mathbf{\text{X}}}_{s}^{T}{\mathbf{\text{X}}}_{t}$ with *s* < *t*, rather than the raw data **X*** _{s}* and

- For each 1 ≤
*s < t*≤*K*, let ${\mathbf{\text{Y}}}_{st}^{1}\leftarrow {\mathbf{\text{X}}}_{s}^{T}{\mathbf{\text{X}}}_{t}$. - For
*j*1, ...,*J*:- Compute ${\mathbf{\text{w}}}_{1}^{j},\dots ,{\mathbf{\text{w}}}_{K}^{j}$ by applying the single-factor sparse mCCA algorithm to data ${\mathbf{\text{Y}}}_{st}^{j}$ for 1 ≤
*s*<*t*≤*K*. - ${\mathbf{\text{Y}}}_{st}^{j+1}\leftarrow {\mathbf{\text{Y}}}_{st}^{j}-({\mathbf{\text{w}}}_{s}^{{j}^{T}}{\mathbf{\text{Y}}}_{st}^{j}{\mathbf{\text{w}}}_{t}^{j}){\mathbf{\text{w}}}_{s}^{j}{\mathbf{\text{w}}}_{t}^{{j}^{T}}$.

- ${\mathbf{\text{w}}}_{1}^{j},\dots ,{\mathbf{\text{w}}}_{K}^{j}$ are the
*j*canonical vectors.^{th}

^{*}We thank Andrew Beck, Patrick Brown, Jonathan Pollack, Robert West, and anonymous reviewers for helpful comments. Daniela Witten was supported by a National Defense Science and Engineering Graduate Fellowship. Robert Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183.

- Alizadeh A, Eisen M, Davis RE, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Marti G, Moore T, Hudson J, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage K, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L. ‘Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling’ Nature. 2000;403:503–511. doi: 10.1038/35000501. [PubMed] [Cross Ref]
- Bair E, Hastie T, Paul D, Tibshirani R. ‘Prediction by supervised principal components’ J Amer Statist Assoc. 2006;101:119–137. doi: 10.1198/016214505000000628. [Cross Ref]
- Bair E, Tibshirani R. ‘Semi-supervised methods to predict patient survival from gene expression data’ PLOS Biology. 2004;2:511–522. doi: 10.1371/journal.pbio.0020108. [PMC free article] [PubMed] [Cross Ref]
- Gifi A. Nonlinear multivariate analysis. Wiley, Chichester; England: 1990.
- Hotelling H. ‘Relations between two sets of variates’ Biometrika. 1936;28:321–377.
- Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringner M, Sauter G, Monni O, Elkahloun A, Kallioniemi O-P, Kallioniemi A. ‘Impact of DNA amplification on gene expression patterns in breast cancer’ Cancer Research. 2002;62:6240–6245. [PubMed]
- Le Cao K, Pascal M, Robert-Granie C, Philippe B. ‘Sparse canonical methods for biological data integration: application to a crossplatform study’ BMC Bioinformatics. 2009;10 doi: 10.1186/1471-2105-10-34. [PMC free article] [PubMed] [Cross Ref]
- Lenz G, Wright G, Emre N, Kohlhammer H, Dave S, Davis R, Carty S, Lam L, Shaffer A, Xiao W, Powell J, Rosenwald A, Ott G, Muller-Hermelink H, Gascoyne R, Connors J, Campo E, Jaffe E, Delabie J, Smeland E, Rimsza L, Fisher R, Weisenburger D, Chano W, Staudt L. ‘Molecular subtypes of diffuse large Bcell lymphoma arise by distinct genetic pathways’ Proc Natl Acad Sci. 2008;105:13520–13525. doi: 10.1073/pnas.0804295105. [PubMed] [Cross Ref]
- Massy W. ‘Principal components regression in exploratory statistical research’ Journal of the American Statistical Association. 1965;60:234–236. doi: 10.2307/2283149. [Cross Ref]
- Morley M, Molony C, Weber T, Devlin J, Ewens K, Spielman R, Cheung V. ‘Genetic analysis of genome-wide variation in human gene expression’ Nature. 2004;430:743–747. doi: 10.1038/nature02797. [PMC free article] [PubMed] [Cross Ref]
- Parkhomenko E, Tritchler D, Beyene J. ‘Genome-wide sparse canonical correlation of gene expression with genotypes’ BMC Proceedings. 2007;1:S119. doi: 10.1186/1753-6561-1-s1-s119. [PMC free article] [PubMed] [Cross Ref]
- Parkhomenko E, Tritchler D, Beyene J. ‘Sparse canonical correlation analysis with application to genomic data integration’ Statistical Applications in Genetics and Molecular Biology. 2009;8:1–34. doi: 10.2202/1544-6115.1406. [PubMed] [Cross Ref]
- Pollack J, Sorlie T, Perou C, Rees C, Jeffrey S, Lonning P, Tibshirani R, Botstein D, Borresen-Dale A, Brown P. ‘Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors’ Proceedings of the National Academy of Sciences. 2002;99:12963–12968. doi: 10.1073/pnas.162471999. [PubMed] [Cross Ref]
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Staudt LM. ‘The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma’ The New England Journal of Medicine. 2002;346:1937–1947. doi: 10.1056/NEJMoa012914. [PubMed] [Cross Ref]
- Stranger B, Forrest M, Clark A, Minichiello M, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis S, Tavare S, Deloukas P, Dermitzakis E. ‘Genome-wide associations of gene expression variation in humans’ PLOS Genetics. 2005;1(6):e78. doi: 10.1371/journal.pgen.0010078. [PubMed] [Cross Ref]
- Stranger B, Forrest M, Dunning M, Ingle C, Beazley C, Thorne N, Redon R, Bird C, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer S, Tavare S, Deloukas P, Hurles M, Dermitzakis E. ‘Relative impact of nucleotide and copy number variation on gene expression phenotypes’ Science. 2007;315:848–853. doi: 10.1126/science.1136678. [PMC free article] [PubMed] [Cross Ref]
- Tibshirani R, Hastie T, Narasimhan B, Chu G. ‘Diagnosis of multiple cancer types by shrunken centroids of gene expression’ Proc Natl Acad Sci. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [PubMed] [Cross Ref]
- Tibshirani R, Hastie T, Narasimhan B, Chu G. ‘Class prediction by nearest shrunken centroids, with applications to DNA microarrays’ Statistical Science. 2003:104–117. doi: 10.1214/ss/1056397488. [Cross Ref]
- Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. ‘Sparsity and smoothness via the fused lasso’ J Royal Statist Soc B. 2005;67:91–108. doi: 10.1111/j.1467-9868.2005.00490.x. [Cross Ref]
- Tibshirani R, Wang P. ‘Spatial smoothing and hotspot detection for CGH data using the fused lasso’ Biostatistics. 2008;9:18–29. doi: 10.1093/biostatistics/kxm013. [PubMed] [Cross Ref]
- Waaijenborg S, Verselewel de Witt Hamer P, Zwinderman A. ‘Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis’ Statistical Applications in Genetics and Molecular Biology. 2008;7 doi: 10.2202/1544-6115.1329. [PubMed] [Cross Ref]
- Witten D, Tibshirani R, Hastie T. ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’ Biostatistics. 2009 doi: 10.1093/biostatistics/kxp008. [PMC free article] [PubMed] [Cross Ref]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |