Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5613668

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Motivating example
- 3. Method
- 4. Result
- 5. Conclusion and discussion
- Supplementary Material
- References

Authors

Related links

Ann Appl Stat. Author manuscript; available in PMC 2017 September 26.

Published in final edited form as:

PMCID: PMC5613668

NIHMSID: NIHMS861529

Department of Biostatistics, University of Pittsburgh, Pittsburgh, ennsylvania 15261, USA

Department of Biostatistics, Human Genetics, and Computational Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA

Cancer subtypes discovery is the first step to deliver personalized
medicine to cancer patients. With the accumulation of massive multi-level omics
datasets and established biological knowledge databases, omics data integration
with incorporation of rich existing biological knowledge is essential for
deciphering a biological mechanism behind the complex diseases. In this
manuscript, we propose an integrative sparse *K*-means
(is-*K* means) approach to discover disease subtypes with the
guidance of prior biological knowledge via sparse overlapping group lasso. An
algorithm using an alternating direction method of multiplier (ADMM) will be
applied for fast optimization. Simulation and three real applications in breast
cancer and leukemia will be used to compare is-*K* means with
existing methods and demonstrate its superior clustering accuracy, feature
selection, functional annotation of detected molecular features and computing
efficiency.

While cancer has been thought to be a single type of disease, increasing evidence from modern transcriptomic studies have suggested that each specific cancer may consist of multiple subtypes, with different disease mechanisms, survival rates and treatment responses. Cancer subtypes have been extensively studied, including in leukemia [Golub et al. (1999)], lymphoma [Rosenwald et al. (2002)], glioblastoma [Parsons et al. (2008); Verhaak et al. (2010)], breast cancer [Lehmann et al. (2011); Parker et al. (2009)], colorec-tal cancer [Sadanandam et al. (2013)] and ovarian cancer [Tothill et al. (2008)]. These subtypes usually have strong clinical relevance since they show different outcome, and might be responsive to different treatments [Abramson et al. (2015)]. However, single cohort/single omics (e.g., transcriptome) analysis suffers from sample size limitation and reproducibility issues [Simon et al. (2003); Simon (2005); Domany (2014)]. Over the years, large amount of omics data are accumulated in public databases and depositories, for example, The Cancer Genome Atlas (TCGA) http://cancergenome.nih.gov, Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/, Sequence Read Archive (SRA) http://www.ncbi.nlm.nih.gov/sra, just to name a few. These datasets provided unprecedented opportunities to reveal cancer mechanisms via combining multiple cohorts or multiple-level omics data types (a.k.a. horizontal omics meta-analysis and vertical omics integrative analysis; see below) [Tseng, Ghosh and Feingold (2012)]. Omics integrative analysis has been found successful in many applications: (e.g., breast cancer [Koboldt et al. (2012)], stomach cancer [Bass et al. (2014)]). On the other hand, a tremendous amount of biological information has been accumulated in public databases. Proper usage of these prior information (e.g., pathway information and miRNA targeting gene database) can greatly guide the modeling of omics integrative analysis.

In the literature, researchers have applied various types of clustering
methods for high-throughput experimental data (e.g., microarray) to identify novel
disease subtypes. Popular methods include hierarchical clustering [Eisen et al. (1998)],
*K*-means [Dudoit and
Fridlyand (2002)], mixture model-based approaches [Xie, Pan and Shen (2008); McLachlan, Bean and Peel (2002)] and
nonparametric approaches [Qin
(2006)], for analysis of single transcriptomic study. Resampling and
ensemble methods have been used to improve stability of the clustering analysis
[Kim et al. (2009); Swift et al. (2004)] or to pursue tight
clusters by leaving scattered samples that are different from major clusters
[Tseng (2007); Tseng and Wong (2005); Maitra and Ramler (2009)]. Witten and Tibshirani (2010) proposed a sparse *K*-means
algorithm that can effectively select gene features and perform sample clustering
simultaneously. To extend single-study techniques towards integration of multiple
omics data sets, Tseng, Ghosh and Feingold
(2012) categorized omics data integration into two major types: (A)
horizontal omics meta-analysis and (B) vertical omics integrative analysis. For
horizontal meta-analysis, multiple studies of the same omics data type (e.g.,
transcriptome) from different cohorts are combined to increase sample size and
statistical power, a strategy often used in differential expression analysis
[Ramasamy et al. (2008)],
pathway analysis [Shen and Tseng
(2010)] or subtype discovery [Huo et al. (2016)]. In contrast, vertical
integrative analysis aims to integrate multi-level omics data from the same patient
cohort (e.g., gene expression data, genome-wide profiling of somatic mutation, DNA
copy number, DNA methylation or microRNA expression from the same set of biological
samples [Richardson, Tseng and Sun
(2016)]). In this paper, we focus on vertical omics integrative
analysis for disease subtype discovery. Several methods for this purpose have been
proposed in the literature. Lock and Dunson
(2013) fitted a finite Dirichlet mixture model to perform Bayesian
consensus clustering that allows common clustering across omics types as well as
omics-type-specific clustering. The model, however, does not perform proper feature
selection, and thus is not suitable for high-dimensional omics data. Shen, Olshen and Ladanyi (2009) proposed a latent
variable factor model (namely iCluster) to cluster cancer samples by integrating
multi-omics data. The method does not incorporate prior biological knowledge and
requires extensive computing due to EM algorithm with large matrix operation. We
will use the popular iCluster method as the baseline method to compare in this
paper.

The central question we ask in this paper is: “Can we identify cancer
subtypes by simultaneously integrating multi-level omics datasets and/or utilizing
existing biological knowledge to increase accuracy and interpretation?”
Several statistical challenges will arise when we attempt to achieve this goal: (1)
If multi-level omics data are available for a given patient cohort, what kind of
method is effective to achieve robust and accurate disease subtype detection via
integrating multi-omics data? (2) Since only a small subset of intrinsic omics
features are relevant to the disease subtype characterization, how can we perform
effective feature selection in the high-dimensional integrative analysis? (3) With
the rich biological information (e.g., targeted genes of each miRNA or potential
cis-acting regulatory mechanism between copy number variation, methylation and gene
expression), how can we fully utilize the prior information to guide feature
selection and clustering? In this paper, we propose an integrative sparse
*K*-means (IS-*K* means) approach by extending the
sparse *K*-means algorithm with overlapping group lasso technique to
accommodate the three goals described above. The lasso penalty in the sparse
*K*-means method allows effective feature selection for
clustering. In the literature, (nonoverlapping) group lasso [Yuan and Lin (2006)] has been developed in a
regression setting to encourage features of the same group to be selected or
excluded together. The approach, however, has two major drawbacks: (1) it does not
allow sparsity within groups (i.e., a group of features are either all selected or
all excluded), and (2) the penalty function does not allow overlapping groups. For
the first issue, Simon et al. (2013) proposed
a sparse group lasso with both an *l*_{1} lasso penalty and a
group lasso penalty to allow sparsity within groups while the approach does not
allow overlapping groups. For the latter issue, overlapping group information from
biological knowledge is frequently encountered in many applications. In genomic
application, for example, the targeted genes of two miRNAs are often overlapped or
two pathways may contain overlapping genes. Jacob,
Obozin-ski and Vert (2009) proposed a duplication technique to allow
overlapping groups in regression setting while the approach does not allow sparsity
within groups. In this paper, we attempt to simultaneously overcome both
aforementioned difficulties in a clustering setting, which brings optimization
challenges beyond the duplication technique by Jacob, Obozinski and Vert (2009) and the sparse group lasso optimization
by Simon et al. (2013). In our proposed
IS-*K* means method, we will develop a novel reformulation of
*l*_{1} lasso penalty and overlapping group lasso penalty
so that a fast optimization technique using alternating direction method of
multiplier (ADMM) [Boyd et al.
(2011)] can be applied (see Section 3.4.1).

The rest of the paper is structured as following. Section 2 gives a
motivating example. Section 3 establishes the method and optimization procedure.
Sections 4.1–4.3 comprehensively compares the proposed method with the
popular iCluster method using simulation and two breast cancer applications on
multilevel omics data. Section 4.4 provides another type of IS-*K*
means application of pathway-guided clustering on single transcriptomic study.
Section 5 includes the final conclusion and discussion.

Figure 1(A) shows a clustering result
using single study sparse *K*-means (detailed algorithm see Section
3.1) on the mRNA, methylation and copy number variation (CNV) datasets separately
from 770 samples in TCGA. As expected, they generate very different disease
subtyping without regulatory inference across mRNA, methylation and CNV. In this
example, single study sparse *K*-means fails to consider that
different omics features belonging to the same genes are likely to contain
cis-acting regulatory mechanisms related to the disease subtypes. Figure 1(B) combines the three datasets to perform
*IS-K*means. The IS-*K* means generates a single
disease subtyping and takes into account of the prior regulatory knowledge between
mRNA, methylation and CNV. The prior knowledge can also be a pathway database (e.g.,
KEGG, BioCarta and Reactome) or knowledge of miRNA targets prediction databases
(e.g., PicTar, TargetScan, DIANA-microT, miRanda, rna22 and PITA) [Witkos, Koscianska and Krzyzosiak (2011); Fan and Kurgan (2015)]. Incorporating
such prior information of feature grouping increases statistical power and
interpretation. Figure 1(C) shows a simple
example of such group prior knowledge. Pathway _{1} includes mRNA1,
mRNA2, mRNA3 and mRNA6 while pathway _{2} includes mRNA3, mRNA4,
mRNA5 and mRNA7. Note that mRNA3 appears in both pathway _{1} and
_{2}, which requires our algorithm to allow overlapping groups.
Our goal is to develop a sparse clustering algorithm integrating multi-level omics
datasets and the aforementioned prior regulatory knowledge by overlapping group
lasso. The algorithm is also suitable for single omics dataset with incorporating
prior overlapping pathway information (see the leukemia examples in Section
4.4).

Consider *X _{jq}* the gene expression intensity of
gene

$$\underset{C}{min}\sum _{j=1}^{J}{\mathit{\text{WCSS}}}_{j}(C)=\underset{C}{min}\sum _{j=1}^{J}\sum _{K=1}^{K}\frac{1}{{n}_{k}}\sum _{p,q\in {C}_{k}}{d}_{pq,j},$$

(3.1)

where *K* is the number of clusters, *J* is
the number of genes (features), *C* =
(*C*_{1}, *C*_{2}, …,
*C _{k}*) denotes the clustering result containing
partitions of all samples into

$$\underset{C}{max}\sum _{j=1}^{J}{\mathit{\text{BCSS}}}_{j}(C)=\underset{C}{max}\sum _{j=1}^{J}[\frac{1}{n}\sum _{p,q}{d}_{pq,j}-\sum _{k=1}^{k}\frac{1}{{n}_{k}}\sum _{p,q\in {C}_{k}}{d}_{pq,j}].$$

(3.2)

The lasso regularization on gene-specific weights in equation (3.2) gives the following sparse
*K*-means objective function:

$$\begin{array}{c}\underset{C,\mathbf{z}}{max}\sum _{j=1}^{J}{z}_{j}{\mathit{\text{BCSS}}}_{j}(C)\\ \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{\Vert \mathbf{z}\Vert}_{1}\le \mu ,{z}_{j}\ge 0,\forall j,\end{array}$$

(3.3)

where *z _{j}* denotes weight for gene

$$\begin{array}{c}\underset{C,\mathbf{z}}{min}-\sum _{j=1}^{J}{z}_{j}{\mathit{\text{BCSS}}}_{j}(C)+\gamma {\Vert \mathbf{z}\Vert}_{1}\\ \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{z}_{j}\ge 0,\forall J,\end{array}$$

We extend the sparse *K*-means objective function to
group structured sparse *K*-means. Here, we consider
*J* to be the total number of features combing all levels of
omics datasets. In order to make features of different omics data types on the
same scale and comparable, we normalized *BCSS _{j}* by

$${R}_{j}(C)=\frac{{\mathit{\text{BCSS}}}_{j}(C)}{{\mathit{\text{TSS}}}_{j}}.$$

We put the overlapping group lasso penalty term
Ω(**z**) in the objective function:

$$\begin{array}{c}\underset{C,\mathbf{z}}{min}-\sum _{j=1}^{J}{z}_{j}{\mathit{R}}_{j}(C)+\gamma \alpha {\Vert \mathbf{z}\Vert}_{1}+\gamma (1-\alpha )\mathrm{\Omega}(\mathbf{z})\\ \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{z}_{j}\ge 0,\end{array}$$

(3.4)

where *γ* is the penalty tuning parameter
controlling the numbers of nonzero features, *α*
[0, 1] is a term controlling the balance between
individual feature penalty and group feature penalty. If
*α* = 1, there is no group feature penalty
term and the objective function is equivalent to sparse *K*-means
objective function after standardizing each feature. If
*α* = 0, there is no individual feature
penalty and only group feature penalty exists. The overlapping group lasso
penalty term is defined as

$$\Omega (\mathbf{z})=\sum _{1\le g\le {\mathcal{G}}_{0}}{w}_{g}{\Vert {\mathbf{m}}_{g}\u25cb\mathbf{z}\Vert}_{2},$$

where _{0} is the number of (possibly overlapping)
feature groups from prior biological knowledge, *w _{g}*
is the group weight coefficient for group

Remark. Since different types of omics datasets may have different value
ranges and distributions, additional normalization may be needed in the
preprocessing. For example, the commonly-used beta values from methy-seq
(defined as “methylation counts”/“total counts”)
represent the proportions of methylation and range between 0 and 1. A logit
transformation to so-called *M*-values is closer to Gaussian
distribution and is more suitable to integrate with other omics data. Similarly,
log-transformation of expression intensities from microarray, log-transformation
of RPKM/TPM (summarized expression values) from RNA-seq and log-ratio values of
CNV values from SNP arrays have been shown to be roughly Gaussian distributed
and are proper for multi-omics integration. Another possibility is by replacing
Euclidean distance to an appropriate distance measurement (e.g., Gower's
distance for binary categorical and ordinal data, and Bray– Curtis
dissimilarity for count data). Under this scenario, equation (3.4) remains valid under such
modification and we only need to incorporate partition around medoids (PAM)
[Kaufman and Rousseeuw
(1987)] instead of *K* -means in the optimization
procedure in Section 3.4.1. However, heterogeneity of different distance
measurement may require extra different sparsity penalties and this is beyond
consideration in this paper.

In this section, we discuss and justify the design of overlapping group
lasso penalty for *w _{g}* and

Definition 3.1 (“Unbiased Feature Selection” principle).
Suppose equal separation ability in all intrinsic features =
{*j* : *R _{j}* =

The theorem below states an overlapping group lasso penalty design that
satisfies the “unbiased feature selection” principle when all
features are intrinsic features (i.e., =
(*ϕ*).

Theorem 3.1. Consider $\Omega (\mathbf{z})={\sum}_{1\le g\le {\mathcal{G}}_{0}}{w}_{g}{\Vert {\mathbf{m}}_{g}\u25cb\mathbf{z}\Vert}_{2}$ and **m**_{g}
= (**m**_{g1},
…,**m**_{gj}, …,
**m**_{gJ}) in equation (3.4). Suppose equal separation
ability for all features *R*_{1} = …
= *R _{J}* =

Theorem 3.1 gives a design of overlapping group lasso penalty such that
given equal separation ability for all features, the feature selection is not
biased by the prior group knowledge. When all the groups are nonoverlapping,
*h*(*j*) = 1,
∀*j*, then

$$\mathrm{\Omega}(\mathbf{z})=\sum _{0\le g\le {\mathcal{G}}_{0}}\left(\sqrt{|{\mathcal{J}}_{g}|}\sqrt{\sum _{j\in {\mathcal{J}}_{g}}{z}_{j}^{2}}\right),$$

where |* _{g}*| is number
of features in group

$$\begin{array}{c}{\mathbf{m}}_{gj}=\mathbb{I}\{j\in {\mathcal{J}}_{g}\}/\sqrt{h(j)},\\ {w}_{g}=\sqrt{\sum _{j\in ({\mathcal{J}}_{g}\cap \mathcal{I})}1/h(j)}.\end{array}$$

(3.5)

Theorem 3.2. Suppose the intrinsic feature set =
{j : R_{j} = R > 0} and the nonintrinsic
feature set = {j : R_{j} =
0}. We further assume R > γ. The overlapping group lasso
penalty in equation (3.5)
satisfies the “unbiased feature selection” principle such that
the optimum solution of **z** from equation (3.4) is z_{j} =
1/√|| for j and
z_{j} = 0 for j .

Note that we take into account both the nonintrinsic features and the
intrinsic features in the penalty design in equation (3.5). Only intrinsic features
contribute to the group weight coefficient *w _{g}*. The
design vector

$$\mathrm{\Omega}(z)=\sqrt{1+1+1/2+1}\sqrt{{z}_{1}^{2}+{z}_{2}^{2}+1/2\times {z}_{3}^{2}+{z}_{6}^{2}}+\sqrt{1/2+1+1+1}\sqrt{1/2\times {z}_{3}^{2}+{z}_{4}^{2}+{z}_{5}^{2}+{z}_{7}^{2}}.$$

Note that in our example mRNA3 is shared by pathway groups
_{1} and _{2}, representing overlapping
group lasso penalty.

In this section, we discuss major issues for optimization of equation (3.4). First, we introduce
transformation of equation (3.4)
such that *l*_{1} norm penalty can be absorbed in
*l*_{2} norm group penalty. Second, we introduce the
optimization procedure for the proposed objective function. Third, we discuss
how to use ADMM to optimize the weight term, which is critical and a difficult
problem since it involves both the *l*_{1} norm penalty
and overlapping group lasso penalty. Last, we discuss the stopping rule for the
optimization.

We use the fact that
*γα*‖**z**‖_{1}
can be rewritten as $\gamma \alpha {\Vert \mathbf{z}\Vert}_{1}=\gamma \alpha {\sum}_{j=1}^{J}{\Vert {\mathbf{z}}_{j}\Vert}_{2}$ and **z _{j}** =
(0,…,

$$\underset{C,z}{min}-\sum _{j=1}^{J}{z}_{j}{R}_{j}(C)+\sum _{j=1}^{J}{\Vert \gamma \alpha {\mathit{\varphi}}_{j}o\mathbf{z}\Vert}_{2}+\sum _{0\le g\le {\mathcal{G}}_{0}}{\Vert \gamma (1-\alpha ){\mathbf{m}}_{g}\u25cb\mathbf{z}\Vert}_{2}$$

(3.6)

s.t. ‖**z**‖_{2} ≤ 1,
*z _{j}* ≥ 0, where

$${\mathit{\beta}}_{g}=\{\begin{array}{cc}\gamma \alpha {\mathit{\varphi}}_{j}& \text{if}\phantom{\rule{0.3em}{0ex}}1\le g\le J,\\ \gamma (1-\alpha ){\mathbf{m}}_{g}& \text{if}\phantom{\rule{0.3em}{0ex}}J+1\le g\le \mathcal{G}.\end{array}$$

Therefore, we can rewrite objective function equation (3.6) as

$$\begin{array}{c}min-\mathbf{R}{(C)}^{\top}Z+\sum _{1\le g\le \mathcal{G}}{\Vert {\mathit{\beta}}_{g}o\mathbf{z}\Vert}_{2}\\ \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{Z}_{j}\ge 0,\end{array}$$

(3.7)

where **R**(*C*) =
(*R*_{1}(*C*), …,
*R _{J}*(

- Initialize weight
**z**using the original sparse*K*-means method without the group lasso term. - Given weight
**z**, use weighted*K*-means to update cluster labels*C*[**R**is the normalized WCSS so minimizing −**R**(*C*)^{}**z**is essentially weighed*K*-means]. This is a nonconvex problem so multiple random starts are recommended to alleviate local minimum problem. - Given the cluster label
*C*,**R**is fixed so optimizing the objective function is a convex problem with respect to solving weight**z**. We use ADMM in the next subsection to update weight**z**. - Iterate 2 and 3 until converge.

The detailed algorithm for Step 3 is outlined in Section 3.4.2 and the stopping rules of Step 3 and Step 4 are described in Section 3.4.3.

Alternating direction method of multiplier (ADMM) [Boyd et al. (2011)] is ideal for
solving the optimization in equation (3.7). We introduce an auxiliary variable
**x*** _{g}* and write down the augmented
Lagrange:

$$min-\mathbf{R}{(C)}^{\top}\mathbf{z}+\sum _{1\le g\le \mathcal{G}}{\Vert {\mathbf{x}}_{g}\Vert}_{2}+\sum _{1\le g\le \mathcal{G}}\{{\mathbf{y}}_{g}^{\top}({\mathbf{x}}_{g}-{\mathit{\beta}}_{g}o\mathbf{z})+\frac{\rho}{2}{\Vert {\mathbf{x}}_{g}-{\mathit{\beta}}_{g}o\mathbf{z}\Vert}_{2}^{2}\}$$

(3.8)

s.t. ‖**z**‖_{2} ≤ 1,
*z _{j}* ≥ 0 and

$$\{\begin{array}{l}{\mathbf{x}}_{g}^{+}=arg\underset{{\mathbf{x}}_{g}}{min}{\Vert {\mathbf{x}}_{g}\Vert}_{2}+{\mathbf{y}}_{g}^{\top}{\mathbf{x}}_{g}+\frac{\rho}{2}{\Vert {\mathbf{x}}_{g}-{\mathit{\beta}}_{g}o\mathbf{z}\Vert}_{2}^{2},\hfill \\ {\mathbf{z}}^{+}=arg\underset{\mathbf{z}}{min}-\sum {z}_{j}{R}_{j}-\sum _{1\le g\le \mathcal{G}}{\mathbf{y}}_{g}^{\top}({\mathit{\beta}}_{g}o\mathbf{z})+\frac{\rho}{2}{\Vert {\mathbf{x}}_{g}^{+}-{\mathit{\beta}}_{g}o\mathbf{z}\Vert}_{2}^{2}\hfill & \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{z}_{j}\ge 0,\\ {\mathbf{y}}_{g}^{+}={\mathbf{y}}_{g}+\rho ({\mathbf{x}}_{g}^{+}-{\mathit{\beta}}_{g}o{\mathbf{z}}^{+}),\hfill \end{array}$$

where the updating equation of ${\mathbf{x}}_{g}^{+}$ and **z**^{+} are
derived from equation (3.8)
and the the updating equation of ${\mathbf{y}}_{g}^{+}$ is imbedded in ADMM procedure
[Boyd et al.
(2011)]. We can derive close form solution for
**x*** _{g}* part and

- Define ${\mathbf{a}}_{g}={\mathit{\beta}}_{g}o\mathbf{z}-\frac{{\mathbf{y}}_{g}}{\rho}$ we have ${\mathbf{x}}_{g}^{+}=(1-\frac{1}{\rho {\Vert {\mathbf{a}}_{g}\Vert}_{2}})+{\mathbf{a}}_{g}$ where (·)
_{+}= max(0, ·). - Define ${b}_{j}={\sum}_{1\le g\le \mathcal{G}}\rho {\mathit{\beta}}_{gj}^{2}$ and ${c}_{j}={\sum}_{1\le g\le \mathcal{G}}(\rho {\mathbf{x}}_{gj}^{+}+{\mathbf{y}}_{gj})o{\mathbf{\beta}}_{gj}$, where
*β*= (_{g}*β*_{g1},*β*_{g2}, …,*β*)_{gJ}^{},**x**= (_{g}**x**_{g}_{1},**x**_{g2},…,**x**)_{gJ}^{}and**y**= (_{g}**y**_{g1},**y**_{g2}, …,**y**)_{gJ}^{}. The solution is given as following: we define ${f}_{j}(u)=(\frac{{R}_{j}+{c}_{j}}{{b}_{j}+2u})+$ If Σ_{j}*f*(_{j}*u*)^{2}< 1, ${z}_{j}^{+}={f}_{j}(0)\forall j$Otherwise, ${z}_{j}^{+}={f}_{j}(u)\forall j$ and*u*is selected s.t. ‖**z**^{+}‖_{2}= 1.

We have two algorithms which require stopping rules. For ADMM in the
optimization of Step 3, the primal residual of group *g* in
ADMM iteration *t* is: **r***t*
= *x** ^{t}*
–

Augmented Lagrangian parameter *ρ* controls
the convergence of ADMM. In fact, large value of *ρ*
will lead tosmall primal residual by placing a large penalty on violations
of primal feasibility. And conversely, small value of
*ρ* tend to produce small dual residual, but it
will result in a large primal residual by reducing the penalty on primal
feasibility [Boydet al.
(2011)]. An adaptive scheme of varying
*ρ* to balance the primal and dualresidual has
been proposed [He, Yang and Wang
(2000); Wang and Liao
(2001)]

which greatly accelerates ADMM convergence in practice:

$${\rho}^{t+1}=\{\begin{array}{cc}{\tau}^{\text{incr}}{\rho}^{t}& \text{if}\phantom{\rule{0.3em}{0ex}}{\Vert {r}^{t}\Vert}_{2}>\eta {\Vert {\nu}^{t}\Vert}_{2},\\ {\rho}^{t}/{\tau}^{\text{decr}}& \text{if}\phantom{\rule{0.3em}{0ex}}{\Vert {\nu}^{t}\Vert}_{2}>\eta {\Vert {r}^{t}\Vert}_{2},\\ {\rho}^{t}& \text{otherwise}.\end{array}$$

We set *η* = 10 and
*τ*^{incr} =
*τ*^{decr} = 2. The intuition
behind this scheme is to control both primal and dual residuals for
converging to zero simultaneously.

In the objective function of IS-*K* means, the number of
clusters *K* is pre-specified. The issue of estimating
*K* has been widely discussed in the literature and has been
well recognized as a difficult and data-dependent problem. [Milligan and Cooper (1985); Kaufman and Rousseeuw (1990)]. Here, we
suggest the number of clusters to be estimated in each study separately using
conventional methods such as prediction strength [Tibshirani and Walther (2005)] or gap
statistics [Tibshirani, Walther and
Hastie (2001)] and jointly compared across studies (such that
the numbers of clusters are roughly the same for all studies) for a final
decision before applying integrative sparse *K*-means. Below we
assume that a common *K* is pre-estimated for all omics
datasets.

Another important parameter to be determined is
*α*, which controls the balance between individual
feature penalty and overlapping group penalty. According to equation (3.4), *α*
= 1 means we only emphasize on individual feature penalty and ignore
overlapping group penalty. In this case, the IS-*K* means is
equivalent to sparse *K*-means. *α*
= 0 means we only emphasize the overlapping group penalty and ignore the
individual feature penalty. Simon et al.
(2013) argued that there is no theoretically optimal selection for
*α* because selection of *α*
relates to multiple factors such as accuracy of prior group information and
sparsity within groups. In general, a large *α* (e.g.,
*α* = 0.95) is suitable when prior group
information may not be accurate or features within selected groups may be
sparse. On the other hand, if we expect mild sparsity within groups and high
accuracy of prior group information, a small *α* (e.g.,
*α* = 0.05) help select features by groups.
In Section 4.1.2, we have performed simulation of different level of prior group
information accuracy (*θ* = 1 and
*θ* = 0.2) and found that
*α* = 0.5 generates robust and high
performance results in the sensitivity analysis. As a result, we apply
*α* = 0.5 throughout the paper unless
otherwise indicated.

The last tuning parameter is *γ*, which is the
penalty coefficient. When *γ* is large, we place large
penalty on the objective function and end up with less selected features. When
*γ* is small, we put a small penalty and will include
more features. We follow and extend the gap statistic procedure [Tibshirani, Walther and Hastie
(2001)] to estimate *γ*:

- For each feature in each omics type, randomly permute the gene expression (permute samples). This creates a permuted data set
*X*^{(1)}. Repeat for*B*times to generate*X*^{(1)},*X*^{(2)}, …,*X*^{(B)}. - For each potential tuning parameter
*γ*, compute the gap statistics as below:$$\text{Gap}(\gamma )=O(\gamma )-\frac{1}{B}\sum _{b=1}^{B}{O}_{b}(\gamma ),$$(3.9)where $O(\gamma )=-{\sum}_{j=1}^{J}{z}_{j}^{\ast}{R}_{j}({C}^{\ast})$ is from observed data, where**z***,*C** are the min-imizer of the objective function in equation (3.4) given*γ*.*O*(_{b}*γ*) is similar to*O*(*γ*) but generated from permuted data*X*^{(b)}. - For a range of selections of
*γ*, select*γ** such that the gap statistics in equation (3.9) is minimized.

Figure 2 shows an example of a
simulated dataset that will be discussed in Section 4.1. In this example, we
used *α* = 0.5 for IS-*K* means
and the minimum gap statistics corresponded to 1778 genes, which is very close
to the underlying truth 1800. The gap statistics for *α*
= 0.05, 0.95,1 are plotted in supplementary materials
[Huo and Tseng (2017), Figure S1] and
they all provided adequate *γ* estimation. In practice,
calculating gap statistics from a chain of *γ* can be
performed efficiently by adopting a warm start for adjacent
*γ*'s. For example, after calculating
*O*(*γ*_{1}), the resulting
weights can be used as an initial value for the next nearby
*γ*_{2} =
*γ*_{1} + Δ to calculate
*O*(*γ*_{2}) in the
optimization iteration for fast convergence.

We evaluated integrative sparse *K*-means
(IS-*K* means) on simulation datasets in Section 4.1,
multiple-level omics applications using breast cancer TCGA (combining mRNA
expression, DNA methylation and copy number variation) and METABRIC (combining mRNA
expression and copy number variation) examples in Section 4.2 and 4.3, and a
pathway-guided single transcriptomic application in leukemia in Section 4.4. In the
simulation, the underlying sample clusters and intrinsic feature set were known and
we demonstrated the better performance of IS-*K* means compared to
iCluster and sparse *K*-means by cluster accuracy, feature selection
and computing time. For the TCGA and METABRIC application, the underlying true
clustering and intrinsic feature set were not known. We evaluated the performance by
clustering similarity using adjusted Rand index (ARI) [Hubert and Arabie (1985)] with subtype definition
by PAM50 [Parker et al.
(2009)], cis-regulatory groups, survival difference between clusters
and computing time. In the leukemia examples, the disease subtypes were defined by
observable fusion gene aberration. We evaluated the performance by clustering
accuracy (ARI) and pathway enrichment analysis on selected genes.

To assess the performance of integrative sparse
*K*-means with different choices of
*α* and compare to the original sparse
*K* -means and iCluster, we simulated *K*
= 3 subtypes characterized by several groups of subtype predictive
genes in each of *S* = 2 omics datasets with 1
≤ *s* ≤ *S* as the omics
dataset index (e.g., *s* = 1 represents gene
expression and *s* = 2 represents DNA methylation).
The prior group information was imposed between groups of subtype predictive
genes across omics datasets. These prior group information represent the
possibility that a group of genes and DNA methylations might be
co-regulated. To best preserve the data nature of genomic studies, we also
simulated confounding variables, correlated gene structure and
noninformative genes. Below is the generative process:

- Subtype predictive genes (intrinsic feature set).
- Denote by
*N*is the number of subjects in subtype_{k}*k*(1 ≤*k*≤ 3). We simulate*N*_{1}~ POI(40),*N*_{2}~ POI(40),*N*_{3}~ POI(30) and the number of subjects is*N*= Σ. Simulate_{k}N_{k}*S*= 2 omics datasets, which share the samples and subtypes. Specifically, we denote*s*= 1 to be the gene expression dataset and*s*= 2 to be the DNA methylation dataset. - Simulate
*M*= 30 feature modules (1 ≤*m*≤*M*) for each omics dataset. Denote*n*to be the number of features in omics dataset_{sm}*s*and module*m*. For each module in*s*= 1, sample*n*_{1}= 30 genes. For each module in_{m}*s*= 2, sample*n*_{2}= 30 methylations. Therefore, there will be of 1800 subtype predictive features among two omics datasets._{m} - Denote by
*μ*is the template gene expression (on log scale) of omics dataset_{skm}*s*(1 ≤*s*≤*S*), subtype*k*(1 ≤*k*≤ 3) and module*m*(1 ≤*m*≤*M*). Simulate the template gene expression*μ*~ N(9, 2_{skm}^{2}) with constrain max|_{p,q}*μ*–_{spm}*μ*| ≥ 1, where_{sqm}*p*,*q*denote two subtypes. This part defines the subtype mean intensity for each module in all omics datasets. Note that since in equation (3.4) we used ${R}_{j}=\frac{{\text{BCSS}}_{j}}{{\text{TSS}}_{j}}$ for standardization, performance of the algorithm is robust to gene expression distribution (e.g., the Gaussian assumption here). - In order to tune the signal of the template gene expression, we introduce a relative effect size
*f*> 0, such that ${\mu}_{\mathit{\text{skm}}}^{\prime}=({\mu}_{\mathit{\text{skm}}}-{min}_{k}{\mu}_{\mathit{\text{skm}}})\times f+{min}_{k}{\mu}_{\mathit{\text{skm}}}$. If*f*= 1, we do not tune the signal. If*f*< 1, we decrease the signal and if*f*> 1, we amplify the signal. - Add biological variation ${\sigma}_{1}^{2}=1$ to the template gene expression and simulate ${X}_{\mathit{\text{skmi}}}^{\prime}~N({\mu}_{\mathit{\text{skm}}}^{\prime},{\sigma}_{1}^{2})$ for each module
*m*, subject*i*(1 ≤*i*≤*N*) of subtype_{k}*k*and omics dataset*s*. - Simulate the covariance matrix Σ
for genes in module_{mks}*m*, subtype*k*and omics dataset*s*, where 1 ≤*m*≤*M,*1 ≤*k*≤ 3 and 1 ≤*s*≤*S.*First simulate ${\sum}_{\mathit{\text{mks}}}^{\prime}~{\mathrm{W}}^{-1}(\mathrm{\Phi},100)$, where Φ = 0.5*I*_{nsm}_{×}+ 0.5_{nsm}*J*_{nsm}_{×}, W_{nsm}^{−1}denotes the inverse Wishart distribution,*I*is the identity matrix and*J*is the matrix with all elements equal 1. Then Σis calculated by standardizing ${\sum}_{\mathit{\text{mks}}}^{\prime}$ such that the diagonal elements are all 1 's._{mks} - Simulate gene expression levels of genes in cluster
*m*as ${({X}_{1\mathit{\text{skmi}}},\dots ,{X}_{\mathit{\text{nsmskmi}}})}^{\top}~\text{MVN}({X}_{\mathit{\text{skmi}}}^{\prime},{\sum}_{\mathit{\text{mks}}})$, where 1 ≤*i*≤*N*, 1 ≤_{ks}*m*≤*M*, 1*≤ k*≤ 3 and 1 ≤*s*≤*S*.

- Noninformative genes.
- Simulate 5000 noninformative genes denoted by
*g*(1 ≤*g*≤ 5000) in each omics dataset. First, we generate the mean template gene expression*μ*~ N(9, 2_{sg}^{2}). Then we add biological variance ${\sigma}_{2}^{2}=1$ to generate ${X}_{\mathit{\text{sgi}}}~N({\mu}_{sg},{\sigma}_{2}^{2}),1\le i\le {N}_{s}.$

- Confounder impacted genes.
- Simulate
*C*= 2 confounding variables. In practice, confounding variables can be gender, race, other demographic factors or disease stage etc. These will add heterogeneity to each study to complicate disease subtype discovery. For each confounding variable*c*(1 ≤*c*≤*C*), we simulate*R*= 10 modules in each omics dataset. For each of these modules*r*(1 ≤_{c}*r*≤_{c}*R*), sample number of genes*n*= 30. Therefore, totally 600 confounder impacted genes are generated in each omics dataset. This procedure is repeated in all_{rc}*S*omics datasets. - For each omics dataset
*s*(1 ≤*s*≤*S*) and each confounding variable*c*, sample the number of confounder subclass*h*=_{sc}*k*. The*N*samples in omics dataset*s*will be randomly divided into*h*subclasses._{sc} - Simulate confounding template gene expression
*μ*~ N(9, 2_{slrc}^{2}) for confounder*c*, gene module*r*, subclass*l*(1 ≤*l*≤*h*) and omics dataset_{sc}*s*. Similar to Step a5, we add biological variation ${\sigma}_{1}^{2}=1$ to the confounding template gene expression ${X}_{\mathit{\text{scrli}}}^{\prime}~N({\mu}_{\mathit{\text{slrc}}},{\sigma}_{1}^{2})$. Similar to Steps a6 and a7, we simulate gene correlation structure within modules of confounder impacted genes.

- Gene grouping information.
- We assume omics dataset
*s*= 1 and*s*= 2 have prior group information on subtype predictive gene modules. There are*M*= 30 modules in each omics dataset. - Suppose subtype predictive genes in the
*m*th module of the first omics dataset are grouped with methylation features in the second omics dataset (totally*n*_{1}+_{m}*n*_{2}= 30 + 30 = 60 features are in the same group). With probability 1 –_{m}*θ*(0 ≤*θ*≤ 1), each feature out of the 60 features will be randomly replaced by a confounder impacted gene or noninformative gene. Note that the same replaced feature can appear in multiple subtype predictive gene groups. We set*θ*= 1 and 0.2 to reflect 100%, 20% accuracy of prior group information.

For IS-*K* means, the tuning parameter
*γ* was selected by gap statistics introduced in
Section 3.5. Table 1 shows the result
of gap statistics to select the best *γ* in the
simulation of *α* = 0.5,
*θ* = 1. The smallest gap statistics was
selected at *γ* = 0.21 that correspond to
selecting 1778 features, which was close to the underlying truth. Similarly,
gap statistics result for *α* = 1, 0.95, 0.05
are in the supplementary
materials [Huo and Tseng
(2017), Figure
S1]. For simulation, we generated two scenarios with
relative effect size *f* = 0.6 and *f*
= 0.8. The complete simulation result of *f*
= 0.6 is shown in Table 1 and
the result for *f* = 0.8 is in the supplementary materials
[Huo and Tseng (2017),
Table
S1]. For iCluster and sparse *K*-means, we
allowed them to choose their own optimum tuning parameters. Note that sparse
*K*-means was adopted to each individual omics datatype.
We used ARI [Hubert and Arabie
(1985)] and Jaccard index [Jaccard (1901)] to evaluate the
clustering and feature selection performance. ARI calculated similarity of
the clustering result with the underlying true clustering in simulation
(range from −1 to 1 and 1 represents exact same partition compared
to the underlying truth). Jaccard index compared the similarity and
diversity of two feature sets, defined as the size of the intersection of
two feature sets divided by the size of the union of two feature sets (range
from 0 to 1 and 1 represent identical feature sets compared to the
underlying truth). Clearly, IS-*K* means outperformed
iCluster and individual study sparse *K*-means in terms of
ARI and Jaccard index. IS-*K* means and sparse
*K*-means outperformed iCluster in terms of computing
time. Within IS-*K* means, we compared feature selection in
terms of area under the curve (AUC) of ROC curve, which avoids the issue of
tuning parameter selection. When *θ* = 1
(representing the grouping information is accurate), smaller
*α* (representing larger emphasize on grouping
information) yielded better feature selection performance in terms of AUC as
expected. However, when *θ* = 0.2
(representing many errors in the grouping information), smaller
*α* yielded worse performance in terms of AUC.
Note that *α* = 0.5 gives robustness and
performs well in the two extremes of *θ* = 1
and *θ* = 0.2. In all applications below, we
will apply *α* = 0.5 unless otherwise
noted.

We also evaluated the stability of the algorithm against data
perturbation. Instead of Gaussian distribution in the data generative
process, we utilized heavy tailed t-distribution to generate the expression.
In the simulation setting Step a3, the template gene expression is simulated
from a t-distribution with degree of freedom 3, location parameter 9 and
scale parameter 2. In Step a4, we set relative effect size
*f* = 0.6 and *f* = 0.8,
respectively. In Step a5, ${X}_{\mathit{\text{skmi}}}^{\prime}$ is simulated from a t-distribution with
degree of freedom 3, location parameter ${\mu}_{\mathit{\text{skmi}}}^{\prime}$ s and scale parameter ${\sigma}_{1}^{2}$. The result for data perturbation is in
supplementary
materials [Huo and Tseng
(2017), Tables
S5 and S6]. The resulting message remains almost the same
as the conclusion in Section 4.1.2. Therefore, our proposed algorithm is
robust against non-Gaussian or heavy tail distributions.

We downloaded TCGA breast cancer (BRCA) multi-level omics datasets from
TCGA NIH official website. TCGA BRCA gene expression (IlluminaHiSeq RNAseqV2)
was downloaded on 04/03/2015 with 20,531 genes and 1095 subjects. TCGA BRCA DNA
methylation (Methylation450) was downloaded on 09/12/2015 with 485,577 probes
and 894 subjects. TCGA BRCA copy number variation (BI gis-tic2) was downloaded
on 09/12/2015 with 24,776 genes and 1079 subjects. There were 770 subjects with
all these three omics data types. Features (probes/genes) with any missing value
were removed. For gene expression, we transformed the FPKM value by
log_{2}(· + 1), where 1 is a pseudo-count to avoid
undefined log_{2}(0), such that the transformed value was on continuous
scale. For methylation, the Methylation450 platform provided beta value with
range 0 < *β* < 1, where 0 represents
un-methylated and 1 represents methylated. We transformed the beta value to
*M* value, which is defined by a logit transformation $(M={log}_{2}[\frac{\beta}{1-\beta}])$. Therefore, methylation characterized by
*M* value is on a continuous scale, similar to mRNA and CNV.
If multiple methylation probes matched to the same gene symbol, we selected one
methylation probe as a representative, which had the largest average correlation
with other methylation probes of the same gene symbol. We ended up with 20,147
methylation probes with unique gene symbols.

We filtered out 50% low expression genes (unexpressed genes) and
then 50% low variance genes (noninformative genes). 50% low
expression genes are genes with the lowest 50% mean of gene expression
across samples and 10,250 genes remained after this filtering step. 50%
low variance genes are genes with the lowest 50% variance of gene
expression across samples and 5125 genes remained after this filtering step. We
obtained 4815 CNV features and 5035 methylation features by matching to the 5125
gene symbols. The features from three different omics datasets that shared the
same cis-regulatory annotation (same gene symbol) were grouped together to form
5125 feature groups. In this case, each group had one mRNA gene expression, one
CNV gene and/or one methylation probe. Each group contained candidate
multi-omics regulatory information because CNV and methylation could potentially
regulate mRNA expression. We applied IS-*K* means with
*α* = 0.5, sparse *K*-means by
directly merging three omics datasets together as well as iCluster. Number of
clusters *K* was set to be 5 since it was well established that
breast cancer has 5 subtypes by PAM50 definition [Parker et al. (2009)]. For a fair comparison,
we selected the tuning parameter for each method such that number of selected
features are close to 2000.

For evaluation purposes, we investigated three categories of groups
among selected features: G1, G2 and G3. G3 represents feature groups (gene
symbol) where all three types (mRNA, CNV and methylation) of features are
selected. Similarly, G2 represents feature groups (gene symbol) where only two
types of features are selected; G1 represents feature groups (gene symbol) where
only one type of feature is selected. We also compared the clustering result
with PAM50 subtype definition in terms of ARI. The result is shown in Table 2. Clearly, IS-*K*
means obtained more G2 and G3 features than sparse *K*-means and
iCluster. This is biologically more interpretable but not surprising since
IS-*K* means incorporated the multi-omics regulatory
information and we expected feature of the same group were encouraged to come
out together. Besides, IS-*K* means has higher ARI compared to
sparse *K*-means and iCluster, indicating the clustering result
of IS-*K* means is closer to PAM50 definition than sparse
*K*-means and iCluster. The 5-by-5 confusion table of
IS-*K* means clustering result and PAM50 subtypes is shown in
supplementary
materials [Huo and Tseng
(2017), Table
S3]. One should note the the ARI for all these three methods
are not very high. This could be because PAM50 was defined by gene expression
only and in our scenario we integrated multi-omics information. The heatmaps of
IS-*K* means result is shown in Figure 1(B). In terms of computing time, IS-*K* means
is nearly 20 times faster than iCluster.

We tested the performance of IS-*K* means in another
large breast cancer multi-omics (sample size *n* = 1981)
dataset METABRIC [Curtis et al.
(2012)] with mRNA expression (llu-mina HumanHT12v3) and CNV
(Affymetrix SNP 6.0 chip) and survival information. The datasets are available
at https://www.synapse.org/#Synapse:syn1688369/wiki/27311.
There were originally 49,576 probes in gene expression. If multiple probes
matched to the same gene symbol, we selected the probe with the largest IQR
(interquartile range) to represent the gene. After mapping the probes to gene
symbols, we obtained 19,489 mRNA expression features and 18,538 CNV features,
which shared 1981 samples. After filtering out 30% low expression mRNA
based on mean gene expression across samples and then 30% low variance
mRNA based on variance of gene expression across samples, we ended up with 9504
mRNA features. We obtained 8696 CNV feature symbols by matching with mRNA
feature symbols. Therefore, we had totally 18,200 features and 9504 feature
groups (share the same gene symbol) among 1981 samples.

We applied IS-*K* means with *α*
= 0.5, sparse *K*-means by directly merging three omics
dataset together as well as iCluster. The number of clusters *K*
was set to be 5 (same reason in TCGA). For a fair comparison, we selected the
tuning parameter for each method such that number of selected features are close
to 2000. For evaluation purposes, we similarly defined two categories of groups
among selected features. G2 represents feature groups (gene symbol) where both
types of features are selected and G1 represents feature groups (gene symbol)
where only one type of feature is selected. We also compared the clustering
result with PAM50 subtype definition in terms of ARI. The result is shown in
Table 3. Similar to the TCGA example
in Section 4.2, IS-*K* means obtained more G2 features than
sparse *K*-means and iCluster. The log-rank test of survival
difference for the clustering result defined by IS-*K* means is
more significant than sparse *K*-means and iClus-ter.
Furthermore, IS-*K* means has higher ARI compared to sparse
*K*-means and iCluster, indicating the clustering result of
IS-*K* means is closer to PAM50 definition than sparse
*K*-means and iCluster. The 5-by-5 confusion table of
IS-*K* means clustering result and PAM50 subtypes are in the
supplementary
materials [Huo and Tseng
(2017), Table
S4]. In terms of computing time, IS-*K* means
and sparse *K*-means are much faster than iCluster.

In the simulations and applications so far (Sections 4.1–4.3),
we have focused on using the cis-regulatory mechanism as grouping information
for integrating multi-level omics data for sample clustering. In this
subsection, we present a different but commonly encountered application of
pathway-guided clustering in single transcriptomic study. Specifically, we use
pathway information from databases to provide prior overlapping group
information (i.e., a pathway is a group containing tens to hundreds of genes and
two pathways may contain overlapping genes). A transcriptomic study is used for
sample clustering with the overlapping group information. We apply
IS-*K* means to three leukemia transcrip-tomic datasets
[Verhaak et al. (2009); Balgobind et al. (2010) and Kohlmann et al. (2008)] separately and using
three pathway databases (KEGG, BioCarta and Reac-tome) independently, generating
nine IS-*K* means clustering results (see Table 4). The supplementary materials
[Huo and Tseng (2017), Table S2] show a
summary description of the three leukemia transcriptomic studies.

We only considered samples from acute myeloid leukemia (AML) with three fusion gene subtypes: inv(16) (inversions in chromosome 16), t(15; 17) (translo-cations between chromosome 15 and 17), t(8;21) (translocations between chromosomes 8 and 21). These three gene-translocation AML subtypes have been well studied with different survival, treatment response and prognosis outcomes. Since the three subtypes are observable under the microscope, we treated these class labels as the underlying truth to evaluate the clustering performance. The expression data for Verhaak, Balgobind ranged from around [3.169, 15.132] while Kohlmann ranged in [0, 1]. All the datasets were downloaded directly from the NCBI GEO website. Originally, there were 54,613 probe sets in each study. For each study, we removed genes with any missing value in it. If multiple microarray probes matched to the same gene symbol, we selected the probe with the largest interquartile range (IQR) to represent the gene. We ended up with 20,154 unique genes in Verhaak and 20,155 unique genes in Balgobind and Kohlmann. We further filtered out 30% low expression genes in each study, which were defined as 30% of genes with the lowest mean expression. We ended up with 14,108 unique genes in each study.

We obtained the three pathway databases (BioCarta, KEGG and Reactome)
from MSigDB (http://www.broadinstitute.org/gsea/msigdb/collections.jsp#C2)
as the prior group information to guide feature selection in
IS-*K* means. The original pathway sizes were 217, 186 and
674 for BioCarta, KEGG and Reactome. We only kept pathways with size (number of
genes inside pathway) greater or equal to 15 and less or equal to 200 after
intersecting with 14,108 unique genes. After gene size restriction, we ended up
with 114, 160 and 428 pathways for BioCarta, KEGG and Reactome. Note that these
pathway groups have large overlaps (i.e., many genes appear in multiple
pathways).

For each of the three studies, we applied IS-*K* means
(with BioCarta, KEGG and Reactome as prior group information, respectively),
sparse *K*-means and iCluster. Note that in this example,
IS-*K* means dealt with single omics dataset with prior
knowledge. For a fair comparison, we tuned the parameters so that the number of
selected features are close to 1000. The result is shown in Table 4. For Verhaak and Kohlmann,
IS-*K* means and sparse *K*-means almost
recovered the underlying true clustering labels (ARI =
0.901–0.932), while iCluster had relatively smaller ARI (ARI =
0.733). We investigated the heatmap of the clustering result of Verhaak using
iCluster (supplementary
materials [Huo and Tseng
(2017), Figure
S2]) to understand reasons of its worse performance (lower
ARI) and found that its solution converged to a stable clustering configuration
with clear clustering separation. Thus, the worse clustering performance in
iCluster likely comes from a local optimum solution. For Balgobind, the
clustering results from IS-*K* means and sparse
*K*-means had smaller ARI (ARI = 0.792) but iCluster
performed even worse (ARI = 0.214).

To further evaluate functional annotation of the selected intrinsic
genes via each method, we explored pathway enrichment analysis (Figure 3) using BioCarta database via Fisher exact
test. Five methods [iCluster, IS-*K* means (BioCarta),
IS-*K* means (KEGG), IS-*K* means (Reactome),
sparse *K*-means] were compared. The jittered plot of
−log_{10}
*p*-values is shown in Figure
3. IS-*K* means (BioCarta) show the most significant
pathways consistently across three studies; this is somewhat expected since we
used the BioCarta pathway as prior knowledge to guide our feature selection.
IS-*K* means (KEGG) and IS-*K* means
(Reactome) also showed more significant pathways than sparse
*K*-means and iCluster, indicating incorporating prior knowledge
indeed improved feature selection (in the sense that the selected feature are
more biological meaningful). Note that IS-*K* means (KEGG) and
IS-*K* means (Reactome) did not have an overfitting issue
since the test pathway database (BioCarta) was different from the prior
knowledge we utilized. Similarly, the results using KEGG and Reactome as a
testing pathway are in supplementary materials [Huo
and Tseng (2017), Figure S3].

Cancer subtype discovery is a critical step for personalized treatment of
the disease. In the era of massive omics datasets and biological knowledge, how to
effectively integrate omics datasets and/or incorporate existing biological evidence
brings new statistical and computational challenges. In this paper, we proposed an
integrative sparse *K*-means (IS-*K* means) approach
for this purpose. The existing biological information is incorporated in the model
and the resulting sparse features can be further used to characterize the cancer
subtype properties in clinical application.

Our proposed IS-*K* means has the following advantages.
First, integrative analysis increases clustering accuracy, statistical power and
explainable regulatory flow between different omics types of data. The existing
biological information is taken into account by using the overlapping group lasso.
Fully utilizing the inter-omics regulatory information and external biological
information will increase the accuracy and interpretation of the cancer subtype
findings. Second, we reformulated the complex objective function into a simplified
form where weighted *K*-means and ADMM can be iteratively applied to
optimize the convex sub-problems with closed-form solutions. Due to the nature of
classification EM algorithm in *K-*means and closed-form iteration
updates of ADMM, implementation of the IS-*K* means framework is
computationally efficient. IS-*K* means only takes 10-15 minutes for
15,000 omics features and more than 700 subjects on a standard desktop with single
computing thread while iCluster takes almost 4 hours. Third, the resulting sparse
features from IS-*K* means have better interpretation than features
selected from iCluster.

IS-*K* means potentially has the following limitations. The
existing biological information is prone to errors and can be updated frequently.
Incorporating false biological information may dilute information contained in the
data and even lead to biased finding. Therefore, we suggest not to over-weigh the
overlapping group lasso term and choose *α* = 0.5 to
adjust for the balance between information from existing biological knowledge and
information from the omics datasets. The users can, however, tune this parameter
depending on the strength of their prior belief of the biological knowledge. Another
limitation is that IS-*K* means can only deal with one cohort with
multiple types of omics data. How to effectively combine multiple cohorts with
multi-level omics data will be a future work. R package “ISKmeans”
incorporates C++ for fast computing and it is publicly available on
GitHub https://github.com/Caleb-Huo/IS-Kmeans as well as the
authors’ websites. All the data and code presented in this paper are also
available on the authors’ websites.

**Supplement to “Integrative Sparse K-means
with overlapping group lasso in genomic applications for disease subtype
discovery”** (DOI: 10.1214/17-AOAS1033SUPP;.pdf). This
supplementary materials contain 3 figures and 6 tables, regarding results for
tuning parameter selection, simulation, leukemia dataset description, comparison
of IS-

Click here to view.^{(726K, pdf)}

The authors appreciate the many insightful critiques from the reviewers and Associate Editor.

Proof of Theorem 3.1. Given equal separation ability for each
feature *R*_{1} = … =
*R _{j}* = … =

$$\begin{array}{cc}\underset{C,\mathbf{z}}{min}-\sum _{j=1}^{J}{\mathbf{z}}_{j}R+\gamma \alpha {\Vert \mathbf{z}\Vert}_{1}+\gamma (1-\alpha )\sum _{1\le g\le {\mathcal{G}}_{0}}\left(\sqrt{\sum _{j\in {\mathcal{J}}_{g}}1/h(j)}\sqrt{\sum _{j\in {\mathcal{J}}_{g}}1/h(j)\times {z}_{j}^{2}}\right)& \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{z}_{j}\ge 0,\forall j.\end{array}$$

First, we can take away the constraint
*z _{j}* ≥ 0, ∀

$$L(\mathbf{z},\lambda )=-\sum _{j=1}^{J}{z}_{j}R+\gamma \alpha {\Vert \mathbf{z}\Vert}_{1}+\gamma (1-\alpha )\sum _{1\le g\le {\mathcal{G}}_{0}}\left(\sqrt{\sum _{j\in {\mathcal{J}}_{g}}1/h(j)}\sqrt{\sum _{j\in {\mathcal{J}}_{g}}1/h(j)\times {z}_{j}^{2}}\right)+\lambda ({\Vert \mathbf{z}\Vert}_{2}^{2}-1).$$

Partial derivative of the Lagrange is

$$\begin{array}{cc}\frac{\partial \mathit{L}(\mathbf{z})}{{\partial}_{Zj}}=-R+\gamma \alpha \frac{\partial |{z}_{j}|}{\partial {z}_{j}}& +\gamma (1-\alpha )\sum _{1\le g\le {\mathcal{G}}_{0}}\left(\sqrt{\sum _{{j}^{\prime}\in {\mathcal{J}}_{g}}1/h({j}^{\prime})}\frac{\mathbb{I}\{j\in {\mathcal{J}}_{g}\}\times 1/h(j)\times {Z}_{j}}{\sqrt{{\sum}_{{j}^{\prime}\in {\mathcal{J}}_{g}}1/h({j}^{\prime})\times {z}_{{j}^{\prime}}^{2}}}\right)+2\lambda {z}_{j}.\end{array}$$

It is easy to verify that ${z}_{1}={z}_{2}=\cdots ={z}_{j}=1/\sqrt{J},\lambda =\frac{\sqrt{J}(R-\gamma )}{2}$ will make $\frac{\partial L(\mathbf{z})}{\partial {z}_{j}}=0,{\forall}_{j}$. Since the object function is a convex function, according to sufficiency of the KKT condition, the proposed penalty design will lead to the solution of the “unbiased feature selection” principle.

Proof of Theorem 3.2. For the intrinsic gene set , we have
*R _{j}* =

$$\begin{array}{cc}\underset{C,\mathbf{z}}{min}-\sum _{j=1}^{J}\mathbf{z}jR\mathbb{I}(j\in \mathcal{I})+\gamma \alpha {\Vert \mathbf{z}\Vert}_{1}+\gamma (1-\alpha )\sum _{1\le g\le {\mathcal{G}}_{0}}\left(\sqrt{\sum _{j\in ({\mathcal{J}}_{g}\cap \mathcal{I})}1/h(j)}\sqrt{\sum _{j\in {\mathcal{J}}_{g}}1/h(j)\times {z}_{j}^{2}}\right)& \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,{z}_{j}\ge 0,\forall j.\end{array}$$

First, we can similarly take away the constraint
*z _{j}* ≥ 0,
∀

$$\mathit{L}(\mathbf{z},\lambda )=-\sum _{j=1}^{J}\mathbf{z}j\mathit{R}\mathbb{I}(j\in \mathcal{I})+\gamma \alpha {\Vert \mathbf{z}\Vert}_{1}+\gamma (1-\alpha ){\sum}_{1\le g\le {\mathcal{G}}_{0}}\left(\sqrt{\sum _{j\in ({\mathcal{J}}_{g}\cap \mathcal{I})}1/h(j)}\sqrt{\sum _{j\in {\mathcal{I}}_{g}}1/h(j)\times {z}_{j}^{2}}\right)+\lambda ({\Vert \mathbf{z}\Vert}_{2}^{2}-1).$$

The partial derivative of the Lagrange is

$$\frac{\partial \mathit{L}(\mathbf{z})}{{\partial}_{Zj}}=-R\mathbb{I}(j\in \mathcal{I})+\gamma \alpha \frac{\partial |{z}_{j}|}{\partial {z}_{j}}+\gamma (1-\alpha )\sum _{1\le g\le {\mathcal{G}}_{0}}\left(\sqrt{\sum _{{j}^{\prime}\in ({\mathcal{J}}_{g}\cap \mathcal{I})}1/h({j}^{\prime})}\frac{\mathbb{I}\{j\in {\mathcal{J}}_{g}\}\times 1/h(j)\times {Z}_{j}}{\sqrt{{\sum}_{j\in {\mathcal{I}}_{g}}1/h(j)\times {z}_{j}^{2}}}\right)+2\lambda {z}_{j}.$$

It is easy to verify that if for *j*
, *z _{j}* =
1/√

There are two optimization problems:

$$\{\begin{array}{l}{\mathbf{x}}_{g}^{+}=arg\underset{{\mathbf{x}}_{g}}{min}{\Vert {\mathbf{x}}_{g}\Vert}_{2}+{\mathbf{y}}_{g}^{\top}{\mathbf{x}}_{g}+\frac{\rho}{2}{\Vert {\mathbf{x}}_{g}-{\beta}_{g}\circ \mathbf{z}\Vert}_{2}^{2},\hfill \\ {\mathbf{z}}^{+}=arg\underset{\mathbf{z}}{min}-\sum {\mathbf{z}}_{j}{\mathit{R}}_{j}-\sum _{1\le g\le \mathcal{G}}{\mathbf{y}}_{g}^{\top}({\beta}_{g}\circ \mathbf{z})+\frac{\rho}{2}{\Vert {\mathbf{x}}_{g}^{+}-{\beta}_{g}\circ z\Vert}_{2}^{2}\hfill & \text{subject to}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{z}\Vert}_{2}\le 1,\phantom{\rule{0.3em}{0ex}}{Z}_{j}\ge 0.\end{array}$$

It is a convex optimization problem for ${\mathbf{x}}_{g}^{+}$ with no constraint. The stationarity condition states that the sub-gradient of the objective function will be 0 at the optimum solution. Therefore, we have

$$S({\mathbf{x}}_{g}^{+})+{\mathbf{y}}_{g}+\rho ({\mathbf{x}}_{g}^{+}-{\beta}_{g}\circ \mathbf{z})=0,$$

where *S*(**v**) is the sub-gradient of
‖**v**‖_{2} and

$$S(\mathbf{v})\in \{\begin{array}{cc}\frac{\mathbf{v}}{{\Vert \mathbf{v}\Vert}_{2}}& \text{if}\phantom{\rule{0.3em}{0ex}}{\Vert \mathbf{v}\Vert}_{2}\ge 1,\\ 0& \text{otherwise}.\end{array}$$

If we define ${\mathbf{a}}_{g}={\beta}_{g}\circ \mathbf{z}-\frac{{\mathbf{y}}_{s}}{\rho}$, it can be derived that ${\mathbf{x}}_{g}^{+}=(1-\frac{1}{\rho {\Vert {\mathbf{a}}_{g}\Vert}_{2}})+{\mathbf{a}}_{g}$ where (·)_{+}
= max(0, ·).

The optimization problem for **z**^{+} is
a convex optimization problem with two constraints. We first write down the
Lagrange function and convert the constrained optimization problem into an
un-constrained optimization problem:

$$\begin{array}{c}arg\underset{\mathbf{z}}{min}-\sum _{j}{\mathbf{z}}_{j}{\mathit{R}}_{j}-\sum _{1\le g\le \mathcal{G}}{\mathbf{y}}_{g}^{\top}({\beta}_{g}\circ \mathbf{z})+\frac{\rho}{2}{\Vert {\mathbf{x}}_{g}^{+}-{\beta}_{g}\circ \mathbf{z}\Vert}_{2}^{2}\\ +u({\Vert \mathbf{z}\Vert}_{2}-1)-\sum _{j}{\nu}_{j}{z}_{j}\end{array}$$

such that *u* ,
*u* ≥ 0, *ν _{j}*
and

- Abramson VG, Lehmann BD, Ballinger TJ, Pietenpol JA. Subtyping of triple-negative breast cancer: Implications for therapy. Cancer. 2015;121:8–16. [PMC free article] [PubMed]
- Balgobind BV, Van den Heuvel-Eibrink MM, De Menezes RX, Reinhardt D, Hollink IHIM, Arentsen-Peters STJCM, van Wering ER, Kaspers GJL, Cloos J, de Bont ESJM, Cayuela JM, Baruchel A, Meyer C, Marschalek R, Trka J, Stary J, Beverloo HB, Pieters R, Zwaan CM, den Boer ML. Evaluation of gene expression signatures predictive of cytogenetic and molecular subtypes of pediatric acute myeloid leukemia. Haematologica. 2010;96:221–230. [PubMed]
- Bass A, Thorsson V, Shmulevich I, Reynolds S, Miller M, Bernard B, Hinoue T, Laird P, Curtis C, Shen H, et al. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513:202–209. [PubMed]
- Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn. 2011;3:1–122.
- Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486:346–352. [PMC free article] [PubMed]
- Domany E. Using high-throughput transcriptomic data for prognosis: A critical overview and perspectives. Cancer Res. 2014;74:4612–4621. [PubMed]
- Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 2002;3:1–21. [PMC free article] [PubMed]
- Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. [PubMed]
- Fan X, Kurgan L. Comprehensive overview and assessment of computational prediction of microRNA targets in animals. Brief Bioinform. 2015;16:780–794. [PubMed]
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. [PubMed]
- He BS, Yang H, Wang SL. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. J Optim Theory Appl. 2000;106:337–356. MR1788928.
- Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Ann Statist. 2010;38:2282–2313. MR2676890. [PMC free article] [PubMed]
- Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218.
- Huo Z, Tseng G. Supplement to “Integrative sparse
*K*-means with overlapping group lasso in genomic applications for disease subtype discovery” 2017 doi: 10.1214/17-AOAS1033SUPP. [Cross Ref] - Huo Z, Ding Y, Liu S, Oesterreich S, Tseng G. Meta-analytic framework for sparse
*K*-means to identify disease subtypes in multiple transcriptomic studies. J Amer Statist Assoc. 2016;111:27–42. MR3494636. [PMC free article] [PubMed] - Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaud Sci Nat. 1901;37:547–579.
- Jacob L, Obozinski G, Vert JP. ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning. ACM; New York: 2009. Group lasso with overlap and graph lasso; pp. 433–440.
- Kaufman L, Rousseeuw P. Clustering by Means of Medoids. North-Holland; Amsterdam: 1987.
- Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley; New York: 1990. MR1044997.
- Kim EY, Kim SY, Ashlock D, Nam D. MULTI-K: Accurate classification of microarray subtypes using
ensemble
*k*-means clustering. BMC Bioinform. 2009;10:260. [PMC free article] [PubMed] - Koboldt DC, Fulton RS, McLellan MD, Schmidt H, Kalicki-Veizer J, McMichael JF, Fulton LL, Dooling DJ, Ding L, Mardis ER, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. [PubMed]
- Kohlmann A, Kipps TJ, Rassenti LZ, Downing JR, Shurtleff SA, Mills KI, Gilkes AF, Hofmann WK, Basso G, Dell’Orto MC, Foà R, Chiaretti S, Vos JD, Rauhut S, Papenhausen PR, Hernández JM, Lumbreras E, Yeoh AE, Koay ES, Li R, Liu W, Williams PM, Wieczorek L, Haferlach T. An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: The microarray innovations in LEukemia study prephase. Br J Haematol. 2008;142:802–807. [PMC free article] [PubMed]
- Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest. 2011;121:2750. [PMC free article] [PubMed]
- Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29:2610–2616. [PMC free article] [PubMed]
- MacQueen J. Proc Fifth Berkeley Sympos Math Statist and Probability (Berkeley, Calif, 1965/66) Univ California Press; Berkeley, CA: 1967. Some methods for classification and analysis of multivariate observations; pp. 281–297. MR0214227.
- Maitra R, Ramler IP. Clustering in the presence of scatter. Biometrics. 2009;65:341–352. MR2751457. [PubMed]
- McLachlan GJ, Bean R, Peel D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002;18:413–422. [PubMed]
- Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50:159–179.
- Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27:1160–1167. [PMC free article] [PubMed]
- Parsons DW, Jones S, Zhang X, Lin JCH, Leary RJ, Angenendt P, Mankoo P, Carter H, Siu IM, Gallia GL, et al. An integrated genomic analysis of human glioblastoma multiforme. Science. 2008;321:1807–1812. [PMC free article] [PubMed]
- Qin ZS. Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics. 2006;22:1988–1997. [PubMed]
- Ramasamy A, Mondry A, Holmes CC, Altman DG. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 2008;5:1320–1333. [PMC free article] [PubMed]
- Richardson S, Tseng GC, Sun W. Statistical methods in integrative genomics. Annu Rev Statist Appl. 2016;3:181–209. [PMC free article] [PubMed]
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002;346:1937–1947. [PubMed]
- Sadanandam A, Lyssiotis CA, Homicsko K, Collisson EA, Gibb WJ, Wullschleger S, Ostos LCG, Lannon WA, Grotzinger C, Del Rio M, et al. A colorectal cancer classification system that associates cellular phenotype and responses to therapy. Nat Med. 2013;19:619–625. [PMC free article] [PubMed]
- Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25:2906–2912. [PMC free article] [PubMed]
- Shen K, Tseng GC. Meta-analysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics. 2010;26:1316–1323. [PMC free article] [PubMed]
- Simon R. Development and validation of therapeutically relevant multi-gene biomarker classifiers. J Natl Cancer Inst. 2005;97:866–867. [PubMed]
- Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003;95:14–18. [PubMed]
- Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. J Comput Graph Statist. 2013;22:231–245. MR3173712.
- Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P. Consensus clustering and functional interpretation of gene-expression data. Genome Biol. 2004;5:R94. [PMC free article] [PubMed]
- Tibshirani R, Walther G. Cluster validation by prediction strength. J Comput Graph Statist. 2005;14:511–528. MR2170199.
- Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol. 2001;63:411–423. MR1841503.
- Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008;14:5198–5208. [PubMed]
- Tseng GC. Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 2007;23:2247–2255. [PubMed]
- Tseng G, Ghosh D, Feingold E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 2012;40:3785–3799. [PMC free article] [PubMed]
- Tseng GC, Wong WH. Tight clustering: A resampling-based approach for identifying stable and tight patterns in data. Biometrics. 2005;61:10–16. MR2129196. [PubMed]
- Verhaak RG, Wouters BJ, Erpelinck CA, Abbas S, Beverloo HB, Lugthart S, Löwenberg B, Delwel R, Valk PJ. Prediction of molecular subtypes in acute myeloid leukemia based on gene expression profiling. Haematologica. 2009;94:131–134. [PubMed]
- Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al. Integrated genomic analysis identifies clinically relevant
subtypes of glioblastoma characterized by abnormalities in
*PDGFRA, IDH1, EGFR*, and*NF1*. Cancer Cell. 2010;17:98–110. [PMC free article] [PubMed] - Wang SL, Liao LZ. Decomposition method with a variable parameter for a class of monotone variational inequality problems. J Optim Theory Appl. 2001;109:415–429. MR1834183.
- Witkos TM, Koscianska E, Krzyzosiak WJ. Practical aspects of microRNA target prediction. Curr Mol Med. 2011;11:93–109. [PMC free article] [PubMed]
- Witten DM, Tibshirani R. A framework for feature selection in clustering. J Amer Statist Assoc. 2010;105:713–726. MR2724855. [PMC free article] [PubMed]
- Xie B, Pan W, Shen X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron J Stat. 2008;2:168–212. MR2386092. [PMC free article] [PubMed]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol. 2006;68:49–67. MR2212574.
- Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. MR2279469.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |