Home | About | Journals | Submit | Contact Us | Français |

**|**Europe PMC Author Manuscripts**|**PMC5590725

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Materials and methods
- 3. Results
- 4. Discussion and conclusions
- Supplementary Material
- References

Authors

Related links

Bioinformatics. Author manuscript; available in PMC 2017 September 15.

Published in final edited form as:

Published online 2017 May 23. doi: 10.1093/bioinformatics/btx322

PMCID: PMC5590725

EMSID: EMS73542

Nicolas Städler,^{1,}^{*}^{†} Frank Dondelinger,^{2,}^{†} Steven M. Hill,^{3} Rehan Akbani,^{4} Yiling Lu,^{5} Gordon B. Mills,^{5} and Sach Mukherjee^{6,}^{*}

This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided
the original work is properly cited.

The publisher's final edited version of this article is available at Bioinformatics

Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging. Furthermore, since network differences could provide important and biologically interpretable information to identify molecular subgroups, there is a need to consider the unsupervised task of learning subgroups and networks that define them. This is a nontrivial clustering problem, with neither subgroups nor subgroup-specific networks known at the outset.

We leverage recent ideas from high-dimensional statistics for testing and clustering in the network biology setting. The methods we describe can be applied directly to most continuous molecular measurements and networks do not need to be specified beforehand. We illustrate the ideas and methods in a case study using protein data from The Cancer Genome Atlas (TCGA). This provides evidence that patterns of interplay between signalling proteins differ significantly between cancer types. Furthermore, we show how the proposed approaches can be used to learn subtypes and the molecular networks that define them.

As the Bioconductor package nethet.

Molecular interplay plays a fundamental role in biology and its dysregulation
is a feature of many diseases. It is thought that networks encoding molecular
interplay may depend on biological context such as cell type, tissue type, or
disease subtype. An increasing number of studies, including, among others, ENCODE
(Andersson *et al*., 2014),
BLUEPRINT (Martens and Stunnenberg, 2013) and
TCGA (The Cancer Genome Atlas Network, 2012),
span multiple biological contexts and such studies offer an opportunity to better
understand molecular heterogeneity.

The study of molecular networks involves relatively complex statistical models, since it is not only the mean levels of molecular variables but also measures of interplay between them that are relevant. For this reason, despite the importance of networks in biology, several core bioinformatics tasks remain challenging in the network setting. In this paper we address two of these, two-sample testing and clustering. In network models, the number of statistical parameters may grow rapidly with the number of molecular variables and for this reason network-based analyses typically lead to so-called high-dimensional statistical problems, where the number of parameters is large in relation to the sample size.

To fix ideas and clarify the specific questions we address, first consider
molecular data **X**_{1}, **X**_{2} from two groups, each with the same set
of variables measured, but with potentially different sample sizes. Using these
data, we would like to test significance of differences between the groups, not only
in terms of average molecular abundance (as tested in standard differential
expression analyses and multiple testing extensions thereof), but at the level of
networks describing interplay between the variables. In the general case
group-specific networks are not known in advance but must be estimated from the data
and the resulting variability in estimation must be properly accounted for. This is
the testing problem that we address.

Next consider the unknown groups case, where starting with a dataset **X** (with
no group labels) we seek to identify subsets of samples (i.e. clusters) with their
associated networks. In general neither the cluster assignments nor cluster-specific
networks are known at the outset. This is the clustering problem that we
address.

The testing and clustering problems are different, but share the need to
model underlying networks. We use sparse Gaussian graphical models (GGMs) for this
purpose. The approaches we discuss are likelihood-based and should be extensible to
other classes of model. Estimation for GGMs has been widely studied, including in
bioinformatics (e.g. Schäfer and Strimmer,
2005). There has been progress in high-dimensional methods relevant to
testing and clustering using GGMs (including Chen and
Qin, 2010; Städler and Mukherjee,
2013, 2017; Zhou *et al*., 2009). We note that GGMs are not
causal models and we do not consider causality *per se* in this
paper, although extensions in a causal direction could be possible.

We address the testing problem using a framework proposed in Städler and Mukherjee (2017) that
extends the likelihood ratio test to the high-dimensional setting. Specifically, we
use an application of their methodology to testing networks called
*Differential Network* or *DiffNet*. This is a
formal statistical test that retains validity in the high-dimensional setting. For a
review of related methods and a discussion of methodological differences with
respect to the DiffNet test, we refer the interested reader to the reference.

For the clustering problem, we pursue a model-based approach. In particular,
we use mixture models where the different mixture components (or clusters) are
represented by cluster-specific means and GGMs. Gaussian mixture models have been
well-studied in the low-dimensional setting but estimation remains challenging in
the high-dimensional setting. In the present context, we are interested in interplay
between molecular variables and furthermore we expect that clusters may have
different underlying GGMs. This requires formulations for which classical maximum
likelihood is ill-suited (even when the number of molecular variables
*p* itself is not very large). A computationally and
statistically attractive approach is to use _{1}
-penalization within a mixture model framework and this is the route we pursue.
Specifically, we develop a latent variable extension of the graphical lasso (Friedman *et al*., 2008) that we
call *Mixture Graphical Lasso*, or *MixGlasso*.
MixGlasso renders estimation tractable by assuming sparsity of the cluster-specific
networks, i.e. for each cluster only relatively few edges are important (but these
edges are not pre-specified and can differ between clusters).

The features of MixGlasso that make it practically applicable for bioinformatics are: (i) MixGlasso allows clusters to have different means and networks (i.e. GGMs). (ii) The penalty is designed to automatically adapt to the number of clusters and to the sample size and scale of the clusters (these are unknown at the outset and cannot be dealt with by pre-processing). (iii) MixGlasso uses theoretical results from high-dimensional regression to set the level of penalization automatically, following recent work in the context of hidden Markov models (Städler and Mukherjee, 2013). This means that the procedure is essentially free of tuning parameters and efficient enough to be run on a single core.

Many clustering methods that are widely used in bioinformatics, including
K-means, PAM and hierarchical clustering, differ fundamentally from MixGlasso in
that they are driven by mean levels of variables and do not account for network or
covariance structure. The mclust multivariate clustering tool (Fraley *et al*., 2012) is very popular. The key
difference between mclust and MixGlasso is that the former is not geared towards
high-dimensional problems. Mukherjee and Hill
(2011) discuss penalized, network-based clustering, but using a heuristic
algorithm that unlike MixGlasso is not a principled mixture model. The iCluster
methodology (Shen *et al*.,
2009) performs high-dimensional clustering via a low-dimensional
representation; this differs in intent from MixGlasso, which emphasizes the network
setting in which the original molecular variables and their network connections are
of direct interest.

Pan and Shen (2007) used penalization
for variable selection in clustering, using a mixture model with common diagonal
covariance matrices. In contrast, MixGlasso focuses on the non-identical,
non-diagonal case that is relevant to discovery of clusters that may have different
underlying patterns of molecular interplay. Our approach is similar to Zhou *et al*. (2009) but differs
in the form of the penalty: the MixGlasso penalty is designed to automatically adapt
to the sample size and scale of clusters and the level of penalization is set
automatically.

In summary, the specific contributions of this paper are: (1) We discuss how the DiffNet test can be used for network-related testing in bioinformatics. (2) We propose a penalized mixture model MixGlasso that can be used to cluster data that is likely to be heterogeneous with respect to underlying networks and that automatically takes care of several practical issues; and (3) We illustrate the properties and use of the two approaches by way of simulations and a TCGA case study.

We illustrate the approaches in an analysis of protein data from
*n* = 3467 TCGA samples (data from Akbani *et al*., 2014). Using DiffNet we show that
patterns of protein-protein interplay differ significantly between cancer types,
both at the ‘global’ level of all assayed proteins and at the
‘local’ level of pre-defined signaling pathways. This offers evidence,
over thousands of samples, supporting the notion that signaling depends on disease
lineage and context, i.e. that pathways and networks are contextual. Furthermore,
using MixGlasso, we identify clusters (that can span more than one cancer type or
classical lineage) each having a cluster-specific network. This analysis supports,
from a network perspective, the emerging notion that there may be molecular
commonalities between seemingly distinct cancer types (see e.g. The Cancer Genome Atlas Network, 2012).

We first introduce some notation. Let the number of samples be
*n*, the number of variables be *p* and the
*n × p* data matrix be **X** = [*X*_{1}
… *X _{n}*]

To test whether known groups differ with respect to molecular networks, a starting point is to learn a network model for each group and to then compare the models. Although many procedures are available for learning networks (see e.g. De Smet and Marchal, 2010), the models are inherently complex and typically subject to high statistical variability. This means that observed differences between fitted models may simply be due to such variability. This motivates a need for uncertainty quantification.

The DiffNet test that we use is based on a framework that extends the likelihood ratio test (LRT) to high-dimensions (Städler and Mukherjee, 2017). DiffNet assumes that the data are generated from GGMs and tests the null hypothesis that both groups share the same underlying model, i.e. the null hypotheses

$${\mathbf{\text{H}}}_{0}^{(k,{k}^{\prime})}:{\Omega}_{k}={\Omega}_{{k}^{\prime}},\phantom{\rule{0.8em}{0ex}}k,{k}^{\prime}\in \{1,\dots ,K\}\phantom{\rule{0.5em}{0ex}}\text{and}\phantom{\rule{0.5em}{0ex}}k\ne {k}^{\prime}.$$

(2.1)

The key idea in DiffNet is to exploit estimated sparsity patterns in the
construction of the test statistic and in *P*-value calculation.
The use of sparse structure renders the test effective in high-dimensions but
raises technical questions that are addressed via theory that extends the LRT to
the high-dimensional setting. This gives an asymptotic *P*-value
that remains valid in high-dimensional problems.

DiffNet uses randomized data-splitting: sparsity structure is estimated
using the first half of the data, and *P*-value calculation
carried out using the second half. We consider two variants of DiffNet, one
using a single data split (‘DiffNet(SS)’) and one using multiple
data splits (50 data splits) followed by *P*-value aggregation
(‘DiffNet(MS)’). For full technical details we refer the
interested reader to Städler and
Mukherjee (2017). The overall analysis is computationally tractable:
on the TCGA protein data discussed below a typical (single-split) run of DiffNet
required 2.5 minutes and 2.5 GB of memory (on one core of an Intel Nehalem
processor).

MixGlasso is a penalized mixture of Gaussian graphical models. As above,
let *k* index groups and *K* denote the number of
groups (in the clustering setting both *K* and cluster
assignments are unknown at the outset). Let *S _{i}*
{1, … ,

$$\ell ({\Theta}_{K};\mathbf{\text{X}})={\displaystyle \sum _{i}\text{log}}\left({\displaystyle \sum _{k}{\pi}_{k}N({X}_{i}|{\mu}_{k},{\Omega}_{k}^{-1})}\right).$$

(2.2)

Estimation comprises two (coupled) tasks. The first task is to estimate
Θ_{K}, given the number of clusters
*K*. In MixGlasso this is done by minimizing the negative
penalized log-likelihood to get

$${\widehat{\Theta}}_{K,\lambda}=\underset{{\Theta}_{K}}{\text{arg}\hspace{0.17em}\mathrm{min}}-\ell ({\Theta}_{K};\mathbf{\text{X}})+\lambda \phantom{\rule{0.2em}{0ex}}\text{pen}({\Theta}_{K}),$$

(2.3)

where

$$\text{pen}({\Theta}_{K})={\displaystyle \sum _{k=1}^{K}{\pi}_{k}^{1/2}}{\displaystyle \sum _{j\ne {j}^{\prime}}\left|{\Omega}_{k;j{j}^{\prime}}\right|}/\sqrt{{\Omega}_{k;jj}{\Omega}_{k;{j}^{\prime}{j}^{\prime}}}$$

is the penalty function and *λ* is a regularization
parameter. This specific form of penalty, originally introduced in Städler and Mukherjee (2013) for
hidden Markov models, adapts to the sample size and scale of individual
clusters. Optimization of (2.3) is performed using expectation-maximization (EM)
as outlined in SI.

We set *λ* to a ‘universal’ value
${\lambda}_{\text{uni}}=\sqrt{2n\hspace{0.17em}\text{log}p}/2.$
This penalty is based on well-known theoretical results for high-dimensional
regression and the connection between GGMs and regression, and in the present
setting is valid for the specific penalty given above (for details see Städler and Mukherjee, 2013). The
use of *λ*_{uni} coupled with the penalty above in
effect allows automatic adaptation to cluster size and scale during the EM. As
we show below, this allows MixGlasso to give good results across a range of
settings, while controlling the computational burden; running MixGlasso on the
TCGA dataset (see below) with *K* = 9 clusters required 2 minutes
and 5.3 GB of memory (on one core of an Intel Nehalem processor).

The second task involves determining an appropriate number of clusters
*K**. This is done by minimizing the BIC score
(*λ* is set to
*λ*_{uni}):

$$K*=\underset{K}{\text{arg}\hspace{0.17em}\text{min}}\phantom{\rule{0.2em}{0ex}}\text{BIC}\left({\widehat{\Theta}}_{K,{\lambda}_{\text{uni}}}\right),$$

with

$$\begin{array}{l}\text{BIC}\left({\widehat{\Theta}}_{K,{\lambda}_{\text{uni}}}\right)=-\ell \left({\widehat{\Theta}}_{K,{\lambda}_{\text{uni}}};\mathbf{\text{X}}\right)+\frac{1}{2}\text{log}(n)(K-1)\\ \phantom{\rule{6em}{0ex}}+\frac{1}{2}\text{log}(n){\displaystyle \sum _{k=1}^{K}\text{Df}}(k,{\lambda}_{\text{uni}}),\end{array}$$

where degrees of freedom are set as $\text{Df}(k,{\lambda}_{\text{uni}})=p+{{\displaystyle {\sum}_{{\ell}^{,}\ge l}1}}_{{\left({\widehat{\Omega}}_{k,{\lambda}_{\text{uni}}}\right)}_{{u}^{\prime}}\ne 0}.$

Figure 1A presents results of a simulation study, based on the characteristics of the TCGA data (including sample size and dimensionality; see SI for details), that compares differential network (‘DiffNet’) against standard multivariate tests. The methods we compare against are:

*Likelihood ratio test (‘LRT (Asym)’)*. This is a classical LRT, based on the Gaussian models and with an asymptotic*P*-value.*Permutation LRT (‘LRT(Perm)’)*. This uses the same test statistic as the classical LRT, but obtains a*P*-value by permutation of group labels.*Test based on Fisher’s Z-transform (‘Mult.FisherZ’)*. Here, equality of all partial correlations is tested using*P*-values obtained by transforming each partial correlation using Fisher’s Z-transform (see SI Section 2.2).

(**A**) Differential network (DiffNet), simulation study. Left-to-right
(all as a function of sample size): Type-I error (false positive rate); power
(true positive rate); and area under the ROC curve. Average performance over 100
simulation runs shown. The **...**

In line with theoretical results, we find that DiffNet controls type
I error. Both DiffNet variants, but especially the multi-split variant,
outperform the other approaches in an ROC sense and in terms of power. The
permutation LRT is considerably less powerful than DiffNet, but as expected
catches up at larger sample sizes. The poor performance of the classical LRT
is expected (due to the high-dimensional nature of the problem) and it is
interesting to note that LRT(Asym) does not control Type I error even with
50% more samples than available in TCGA (i.e. more than 300 samples per
group; with number of variables *p* = 181). Additional
simulations using a non-Gaussian model appear in Supplementary Figure
S1A.

Next, we considered TCGA protein data (spanning *p* =
181 proteins listed in Supplementary Table S1) from Akbani *et al*. (2014). Normalization for batch
effects using control samples is discussed in Akbani *et al*. (2014); here, we use the
normalized data presented there and refer the interested reader to the
reference for details.

Akbani *et al*.
(2014) inferred networks specific to each of the 11 cancer types
in the data using the graphical lasso (Friedman *et al*., 2008). Without using
biological prior knowledge, the networks captured many known links but also
showed many differences between cancer types. We used DiffNet to assess the
significance of differences between cancer types (Fig. 1B). These results account for uncertainty in
network estimation and adjust for multiple comparisons due to testing across
cancer-type pairs. Some pairs of cancers do not show significant network
differences at the 1% FDR level, including READ/COAD, LUAD/LUSC, LUSC/HNSC
and UCEC/BLCA (the first three pairs are known to be closely related). Our
results show that with these exceptions, most cancer types indeed appear to
be significantly different at the network level.

We also carried out a similar but ‘local’ analysis using only proteins belonging to specific pre-defined pathways (listed in Supplementary Table S2) rather than all proteins together as above. This analysis can be thought of as similar to a gene-set test, but one that captures differences in partial correlation patterns between gene set members (see Städler and Mukherjee, 2015). This broadly confirmed the global view, but revealed a number of pathway-specific insights (Supplementary Fig. S2).

We tested MixGlasso using a simulation strategy that, as above, aimed to mimic the TCGA protein data. In brief this was done by defining clusters corresponding to the cancer types in the TCGA data, with cluster-specific parameters given by estimates from the TCGA data. Data were then generated from the resulting model. This allowed us to mimic some features of the real data (including sample size and dimensionality) while allowing access to ground-truth cluster assignments. We compared MixGlasso to the following approaches:

*K-means clustering (‘K-means’)*.*Hierarchical clustering*(hclust).*Conventional Gaussian mixture model (‘Gaussian Mixture’)*. This is a classical Gaussian mixture model with unconstrained covariance matrices, fitted using maximum likelihood.*Gaussian mixture model with model selection*(mclust). This is a model-based approach due to Fraley*et al*. (2012) that is one of the most widely used tools for mixture modelling.*Penalized Gaussian mixture model (‘Gaussian Mixture (penalised)’*). This is a Gaussian mixture model with_{1}-penalized inverse covariance matrices. The penalty is as described in Zhou*et al*. (2009), but without penalization of the means. The key difference to MixGlasso is that the penalty does not adapt to the cluster-specific sample size and scale.

We first investigated the estimation of the number
*K* of clusters. There is a vast literature on this topic
and a wide range of heuristics used in practice. We are interested in
addressing the high-dimensional issues inherent to network-based clustering.
For a focused comparison, we therefore compared MixGlasso against mclust and
the penalized Gaussian mixture model as these methods make similar modeling
assumptions to MixGlasso but differ in how they handle the high-dimensional
aspect. This allows for a direct comparison, using the same model selection
approach to select the number of clusters in each case. All the methods are
likelihood-based and we used the Bayesian Information Criterion or BIC to
set the number of clusters. The penalized Gaussian mixture model used the
same regularization parameter *λ*_{uni} as
MixGlasso (see Section 2.2). Figure 2A shows the results of the
analysis. MixGlasso comes closest to determining the correct number of
clusters (*K* = 9), and also agrees well with true cluster
assignments.

MixGlasso, simulation study. (**A**) Boxplot showing adjusted Rand index
of estimated cluster assignments compared to true cluster assignments, over 10
simulated datasets, each with total sample size *n* = 3, 467.
Data were generated from a Gaussian mixture **...**

Next, we considered accuracy of cluster assignments as a function of sample size (Fig. 2B). We included K-means and hclust in the comparison. To avoid confounding by choice of model selection heuristic, we treated the true number of clusters as known and focused on comparing cluster assignments. As expected, performance improves with sample size; however, at smaller sample sizes, MixGlasso tends to outperform the other methods, and only at large sample sizes do the classical mixture models catch up.

Several of the methods, including MixGlasso, make Gaussianity assumptions. We therefore performed additional simulations using a non-Gaussian model (Supplementary Fig. S1B, see also SI Section 4); MixGlasso appears reasonably robust to departures from Gaussianity.

First we used BIC to select the number of clusters
*K*; this showed an optimum at *K* = 8
(Fig. 3A). We tested stability by
iterative subsampling; at each iteration, we removed 1/4 of the data samples
at random, and then scored between-iteration concordance between assignments
(using the adjusted Rand index). MixGlasso is highly stable with an adjusted
Rand index of 0.94 ± 0.03 over subsamples. To assess the stability of
individual clusters we additionally used a resampling approach due to Hennig (2007) that quantifies individual
cluster quality (Supplementary Fig. S3A).

MixGlasso analysis of TCGA protein data. (**A**) Model selection using
the Bayesian Information Criterion (BIC). (**B**) Mosaic plot of
cluster membership in the *K* = 8 novel clusters with respect to
the 11 TCGA cancer types. Dashed lines represent absent cancer **...**

Most of the clusters show a dominant cancer or lineage membership; accordingly we named the clusters after the dominant cancer(s). An exception is a cluster that spans 10/11 cancers and has no dominant cancer; we named this cluster ‘M’ (for mixed) and discuss it below. Figure 3C shows networks across all clusters, indicating cluster-specific as well as shared edges. Note that the partial correlations shown here are obtained by first estimating the cluster membership using MixGlasso, then merging proteins whose expression profiles are highly correlated with those of their phosphorylated forms, and finally re-estimating the networks using the graphical lasso. The merging step is necessary as otherwise the network would be dominated by (trivial) partial correlations between proteins and their phosphorylated forms. We also created a mosaic plot comparing the proposed clustering with the known cancer types (Fig. 3B, see also Supplementary Fig. S3B for the corresponding crosstabulation of the sample numbers). Comparison between cluster assignments and known cancer types shows an adjusted Rand index of 0.73.

We further compared our approach to results reported in Hoadley *et al*. (2014)
obtained using a consensus clustering approach (COCA) applied to six
different TCGA data types (whole-exome DNA sequencing, DNA copy-number
variation, DNA methylation, genome-wide mRNA gene expression, and RPPA
protein expression for 131 proteins). Supplementary Fig. S5
shows a comparison of the MixGlasso clustering with COCA, as well as with
the Pearson-Ward clustering forming part of the input to COCA (PW-131), and
the clustering in Akbani *et
al*. (2014) (PW-181). The comparison focuses on 2,809
TCGA samples in common across the analyses and a number of differences are
apparent. Interestingly, of the three clusterings that use protein data
only, MixGlasso most closely agrees with the integrative consensus
clustering (adjusted Rand Index 0.76, vs 0.65 for PW-131 and 0.6 for
PW-181).

The results below serve to illustrate the nature of the output that can be immediately obtained from MixGlasso. We note that further work will be needed to better understand these results and to more comprehensively test the robustness of the reported patterns.

*A pan-cancer, protein-defined cluster*.

Cluster M spans 10/11 of the TCGA cancer types and there is no clearly dominant cancer type. Cluster M has a distinct protein abundance profile (Supplementary Fig. S4), with (among other proteins) high abundance of Yap and Taz (which share the same target and overlapping functions) and low levels of ATM, RBM15 and MTOR, as well as low levels of some phospho-proteins (with the exception of pp27 and pCHK2) which could be a signal of DNA damage and cell cycle arrest.

*A subset of ‘super basal’ breast
cancers*.

From *n* = 747 BRCA samples in our study, 70 are
assigned to the cluster that contains the majority of the OVCA samples
(‘OVCA-like’), rather than to the cluster that contains the
majority of BRCA samples (‘BRCA-like’). Most of these samples
(67/70) belong to the basal subtype of breast cancer. Genomic similarities
between basal and high-grade serous ovarian samples have been previously
noted (The Cancer Genome Atlas Network,
2012). However, we find that only a subset of basal breast
samples (67 out of a total of 120 basal samples in the study) appear in the
OVCA-like cluster. We compared these samples (‘basal-OVCA’) to
the basal samples that remain clustered with BRCA samples
(‘basal-BRCA’; Fig. 4A,
B). We find that in basal-OVCA, a number of proteins show
significantly different abundance patterns that are typically associated
with the basal subtype; this includes lower abundance of the three markers
for ‘triple-negative’ breast cancer
(ER*α*, PR and pHER2; total HER2 is lower among
basal-OVCA, but not significantly so), as well as higher abundance of
several cyclins, including CyclinE1 and CyclinB1, and higher activation of
several PI3K pathway proteins, including total S6 and two forms of p4EBP1
(Fig. 4B) as well as FoxM1. In each
of these cases, the characteristics that define the basal breast cancer
subtype seem to be amplified in the OVCA-basal samples.

Large datasets are often heterogeneous and often such data are in a sense collections of smaller datasets with non-identical underlying models. Furthermore, models capable of capturing even a very approximate view of molecular interplay tend to be relatively complex. We think that these two factors mean that high-dimensional statistical ideas will play a central role in the emerging areas of precision, stratified and systems medicine, even when sample sizes become much larger than is currently the case. The methods we described should be applicable to most continuous molecular data types and can be run directly with minimal tuning.

We did not consider the case of truly large *p* in this
paper. The methods and ideas we discussed should be relevant to future work aimed at
that setting. At the same time, the challenges involved should not be
underestimated. As we showed, analyses involving moderate numbers of variables but
capturing patterns of interplay are already challenging and require care.

Heterogeneity at one level of biology does not necessarily imply heterogeneity at another. For example, differential expression may or may not be accompanied by differences in patterns of interplay, and differences in gene expression may or may not be accompanied by changes in functional protein levels. Thus, although the intertumoral genomic heterogeneity of cancers is now well established, the question of whether such heterogeneity appears also at the level of signaling proteins has remained open. DiffNet applied to high-quality protein data over thousands of patient samples supported the notion that cancers differ at the protein network level. Furthermore, we showed how MixGlasso could be used to reveal new examples of commonalities across subtypes, where a subset of samples from a certain type is closer to samples from a second type than it is to other samples of the first type.

It is important to note that the graphical models in our analyses are
*not* causal models *per se* and links may be
driven by non-causal associations, e.g. co-regulation by an unobserved confounder
(despite this caveat, many high-scoring links in the networks appeared consistent
with the biochemical literature). However, our approaches are rooted in a
principled, likelihood-based framework and it should be possible to extend the
methods towards causal models and interventional data.

**Supplementary information**: Supplementary data are available at
*Bioinformatics* online.

Click here to view.^{(5.0M, pdf)}

Click here to view.^{(249K, pdf)}

This work was supported in part by the National Institutes of Health (NCI P30 CA016672, U54HG008100, UO1CA168394 and U24CA199461 to G.B.M.) and the UK Medical Research Council (MC_UP_1302/1 and MC_UP_1302/3).

*Conflict of Interest*: none declared.

- Akbani R, et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat Commun. 2014;5:3887. [PMC free article] [PubMed]
- Andersson R, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–461. [PMC free article] [PubMed]
- Chen SX, Qin Y-L. A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat. 2010;38:808–835.
- De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010;8:717–729. [PubMed]
- Fraley C, et al. Technical Report Technical Report No. 597. Department of Statistics, University of Washington; 2012. mclust version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation.
- Friedman J, et al. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. [PMC free article] [PubMed]
- Hennig C. Cluster-wise assessment of cluster stability. Comput Stat Data Anal. 2007;52:258–271.
- Hoadley KA, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158:929–944. [PMC free article] [PubMed]
- Martens JH, Stunnenberg HG. BLUEPRINT: mapping human blood cell epigenomes. Haematologica. 2013;98:1487–1489. [PubMed]
- Mukherjee S, Hill SM. Network clustering: probing biological heterogeneity by sparse graphical models. Bioinformatics. 2011;27:994–1000. [PMC free article] [PubMed]
- Pan W, Shen X. Penalized model-based clustering with application to variable selection. J Mach Learn Res. 2007;8:1145–1164.
- Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. 2005;4 [PubMed]
- Shen R, et al. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25:2906–2912. [PMC free article] [PubMed]
- Städler N, Mukherjee S. Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models. Ann Appl Stat. 2013;7:2157–2179.
- Städler N, Mukherjee S. Multivariate gene-set testing based on graphical models. Biostatistics. 2015;16:47–59. [PubMed]
- Städler N, Mukherjee S. Two-sample testing in high-dimensional models. J R Stat Soc Ser B. 2017;79:225–246.
- The Cancer Genome Atlas. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. [PMC free article] [PubMed]
- Zhou H, et al. Penalized model-based clustering with unconstrained covariance matrices. Electronic J Stat. 2009;3:1473. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |