Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2947451

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Sparse Graphical Factor Models
- 3. Variational Mean-Field Annealing for MAP Search
- 4. Sparse Learning in Graphical Factor Models
- 5. A Stochastic Search Variant for Large p
- 6. Experimental Results on Synthetic Data
- 7. Real Data Applications
- 8. Additional Comments
- Supplementary Material
- References

Authors

Related links

J Mach Learn Res. Author manuscript; available in PMC 2010 September 29.

Published in final edited form as:

J Mach Learn Res. 2010 May 1; 11: 1771–1798.

PMCID: PMC2947451

NIHMSID: NIHMS232200

Ryo Yoshida and Mike West

Ryo Yoshida, Department of Statistical Modeling, Institute of Statistical Mathematics, Minato-ku, Tokyo 106-8569, Japan;

Ryo Yoshida: PJ.CA.MSI@RADIHSOY; Mike West: UDE.EKUD.TATS@WM

See other articles in PMC that cite the published article.

We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have *sparse, orthogonal* factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data.

Bayesian sparse modelling in multivariate analysis is of increasing interest in applications as diverse as life science, economics and information science, and is driving a need for effective computational methods for learning model structure, i.e. sparse configurations. Parallel developments of sparse latent factor models (e.g. West, 2003; Griffiths and Ghahramani, 2006; Lucas et al., 2006; Wang et al., 2007; Archambeau and Bach, 2009; Carvalho et al., 2008; Guan and Dy, 2009; Rai and Daumé, 2009) and inherently sparsely structured graphical models (e.g. Jordan, 1999, 2004; Dobra et al., 2004; Jones et al., 2005; Carvalho and West, 2007) have explored Bayesian computations using a range of stochastic and deterministic search methods. With a view to scaling to higher dimensions and identification of regions of interest in model structure space, efficient and effective computation remains a challenge. We describe a previously undeveloped class of sparse graphical factor models (GFMs)—a subclass of linear, Gaussian latent factor models with sparse factor loadings that also induce sparse conditional independencies. In this context, we develop a computational technique for posterior mode evaluation using a hybrid of variational mean-field method (Attias, 1999; Wainwright and Jordan, 2008) and annealing-based optimization.

As a previously unexplored class of sparse (linear, Gaussian) factor models, the intrinsic graphical structure of the GFM arises from use of an orthogonal factor loadings matrix and appropriate scaling of its columns, together with the usual diagonal covariance matrix for latent factors (with no loss of generality). We show that this generally induces zero elements in the precision matrix of the GFM, as well as the covariance matrix. Particularly, the zero entries in the covariance matrix have corresponding zeros in the precision matrix. We also show that covariance matrices of fitted values (i.e., “data reconstructions”) from such a model have the same sparse structure, and demonstrate aspects of robustness of the model in inferring variable-latent factor relationships in the presence of outliers. These properties are not shared in general by sparse factor models that lack the graphical structure on variables, nor of course by non-sparse approaches. These intrinsic properties of the GFM, along with relationships with earlier studies on sparse factor analyses, are discussed in Section 2.

Our *variational mean-field annealing algorithm (VMA2)* addresses the combinatorial optimization involved in aiming to compute approximate posterior modes for GFM parameters in the context of the huge space of zero/non-zero potential patterns in factor loadings. Using a prescribed schedule of decreasing temperatures, VMA2 successively generates tempered “artificial” posteriors that, at the limiting zero temperature, yield posterior modes for both GFM parameters and the 0/1 loadings indicators. Defined via an artificial, dynamic regularization on the posterior entropy of configured sparse structures, VMA2 is developed in Section 3.

Section 4 provides additional algorithmic details, including prior modelling for evaluating degree of sparseness, and a stochastic variant of VMA2 for higher-dimensional problems is described in Section 5. Performance and comparisons on artificial data appear in Section 6. Section 7 summarizes extensive, detailed empirical comparisons with related approaches in analyses of hand-written digit images and cancer gene expression data. Section 8 concludes with brief additional comments. A range of detailed supplementary materials, extended discussion on the gene expression studies and R code, is accessible from http://daweb.ism.ac.jp/~yoshidar/anneals/.

Observed sample vectors **x**_{i}* ^{p}* in

With ** Z** as the

$${\mathit{x}}_{i}={\mathbf{\Psi}}^{1/2}{\mathbf{\Phi}}_{Z}{\mathit{\lambda}}_{i}+{\mathit{\nu}}_{i}\phantom{\rule{0.5em}{0ex}}\mathrm{with}\phantom{\rule{0.5em}{0ex}}{\mathit{\lambda}}_{i}~N({\mathit{\lambda}}_{i}\mid \mathbf{0},\mathbf{\Delta})\phantom{\rule{0.5em}{0ex}}\mathrm{and}\phantom{\rule{0.5em}{0ex}}{\mathit{\nu}}_{i}~N({\mathit{\nu}}_{i}\mid \mathbf{0},\mathbf{\Psi})$$

(1)

where: *(a)* the factor loading matrix **Ψ**^{1/2}**Φ*** _{Z}* has

$$\mathbf{\sum}={\mathbf{\Psi}}^{1/2}\left\{\mathit{I}+{\mathbf{\Phi}}_{Z}\mathbf{\Delta}{\mathbf{\Phi}}_{Z}^{\prime}\right\}{\mathbf{\Psi}}^{1/2}\phantom{\rule{0.5em}{0ex}}\mathrm{and}\phantom{\rule{0.5em}{0ex}}{\mathbf{\sum}}^{-1}={\mathbf{\Psi}}^{-1/2}\left\{\mathit{I}-{\mathbf{\Phi}}_{Z}\mathit{T}{\mathbf{\Phi}}_{Z}^{\prime}\right\}{\mathbf{\Psi}}^{-1/2}$$

(2)

where ** T** = diag(

In general, a non-orthogonal factor model with the sparse loading matrix ** W**—a sparse extension of probabilistic PCA (Bishop, 1999, 2006)—has the form

$${\mathit{x}}_{i}=\mathit{W}{\mathit{\lambda}}_{i}+{\mathit{\nu}}_{i}\phantom{\rule{0.5em}{0ex}}\mathrm{with}\phantom{\rule{0.5em}{0ex}}{\mathit{\lambda}}_{i}~N(\mathbf{0},\mathit{I})\phantom{\rule{0.5em}{0ex}}\mathrm{and}\phantom{\rule{0.5em}{0ex}}{\mathit{\nu}}_{i}~N(\mathbf{0},\mathbf{\Psi}).$$

The GFM arises when a singular value decomposition is applied to the scaled-factor loading matrix **Ψ**^{−1/2}** W** =

$$\widehat{\mathit{x}}({\mathit{x}}_{i}):=\mathit{W}\mathrm{E}[{\mathit{\lambda}}_{i}\mid {\mathit{x}}_{i}]=\mathit{W}{\mathit{W}}^{\prime}{(\mathit{W}{\mathit{W}}^{\prime}+\mathbf{\Psi})}^{-1}{\mathit{x}}_{i}.$$

Then, asymptotically,

$$\frac{1}{n}\sum _{i=1}^{n}\widehat{\mathit{x}}({\mathit{x}}_{i})\widehat{\mathit{x}}{({\mathit{x}}_{i})}^{\prime}\stackrel{p}{\to}\u2102\mathit{ov}[\widehat{\mathit{x}}({\mathit{x}}_{i})]=\mathit{W}{\mathit{W}}^{\prime}{(\mathit{W}{\mathit{W}}^{\prime}+\mathbf{\Psi})}^{-1}\mathit{W}{\mathit{W}}^{\prime}$$

(3)

and this is generally a non-sparse matrix (no zero entries) even though ** W** is sparse. This is an inconsistency in the sense that data reconstructions should be expected to share the dominant patterns of covariance sparsity evident in the original covariance matrix

Another feature of the GFM is related to a robust property acquired by the implied graphical structure. Consider an example of 4 variables
${x}_{i}^{\prime}=({x}_{i1},{x}_{i2},{x}_{i3},{x}_{i4})$ and 2 factors
${\mathit{\lambda}}_{i}^{\prime}=({\mathit{\lambda}}_{i1},{\mathit{\lambda}}_{i2})$ with two cliques in the conditional independence graph; {*x _{i}*

Denote by **Θ** the full set of parameters **Θ** = {**Φ, Δ, Ψ**}. Our computations aim to explore model structures ** Z** and corresponding posterior modes of parameters

The likelihood function is

$$p(\mathit{X}\mid \mathit{Z},\mathbf{\Theta})\propto \mid \mathbf{\Psi}{\mid}^{-n/2}\mid \mathit{I}-\mathit{T}{\mid}^{n/2}\mathrm{etr}(-\mathit{S}{\mathbf{\Psi}}^{-1}/2+{\mathbf{\Psi}}^{-1/2}\mathit{S}{\mathbf{\Psi}}^{-1/2}{\mathbf{\Phi}}_{Z}\mathit{T}{\mathbf{\Phi}}_{Z}^{\prime}/2)$$

(4)

where etr(** A**) = exp(trace(

$$\mathrm{trace}({\mathbf{\Psi}}^{-1/2}\mathit{S}{\mathbf{\Psi}}^{-1/2}{\mathbf{\Phi}}_{Z}\mathit{T}{\mathbf{\Phi}}_{Z}^{\prime})=\sum _{j=1}^{k}{\tau}_{j}{\mathit{\varphi}}_{\mathit{zj}}^{\prime}{\mathbf{\Psi}}^{-1/2}\mathit{S}{\mathbf{\Psi}}^{-1/2}{\mathit{\varphi}}_{\mathit{zj}}$$

(5)

where * ϕ_{zj}* is column

Priors over non-zero factor loadings may reflect substantive *a priori* knowledge if available, and will then be inherently context specific. For examples here, however, we use uniform priors *p*(**Θ**** Z**) for exposition. Note that, on the critical factor loadings elements

Concerning the sparse structure ** Z**, we adopt independent priors on the binary variates

Conditional on the *p*×*k* matrix of sparsity comtrol hyperparameters ** ζ** whose elements are the

$$2\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}p(\mathit{Z},\mathbf{\Theta}\mid \mathit{X},\mathit{\zeta})=2\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}p(\mathbf{\Theta}\mid \mathit{Z})-\sum _{g=1}^{p}\sum _{j=1}^{k}{z}_{\mathit{gj}}{\zeta}_{\mathit{gj}}-\sum _{g=1}^{p}(n\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}{\psi}_{g}+{s}_{\mathit{gg}}{\psi}_{g}^{-1})+\sum _{j=1}^{k}(n\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}(1-{\tau}_{j})+{\tau}_{j}{\mathit{\varphi}}_{\mathit{zj}}^{\prime}{\mathbf{\Psi}}^{-1/2}\mathit{S}{\mathbf{\Psi}}^{-1/2}{\mathit{\varphi}}_{\mathit{zj}}).$$

(6)

The first two terms in (6) arise from the specified priors for **Θ** and ** Z**, respectively. The quadratic form in the last term is
${\mathit{\varphi}}_{\mathit{zj}}^{\prime}{\mathbf{\Psi}}^{-1/2}\mathit{S}{\mathbf{\Psi}}^{-1/2}{\mathit{\varphi}}_{\mathit{zj}}={\mathit{\varphi}}_{j}^{\prime}\mathit{S}({z}_{j},\mathrm{\Psi}){\mathit{\varphi}}_{j}$ for each

$${(\mathit{S}({z}_{j},\mathrm{\Psi}))}_{\mathit{gh}}={z}_{\mathit{gj}}{z}_{\mathit{hj}}{s}_{\mathit{gh}}{({\psi}_{g}{\psi}_{h})}^{-1/2},\phantom{\rule{0.4em}{0ex}}\mathrm{for}\phantom{\rule{0.2em}{0ex}}g,h=1:p.$$

(7)

The (relative) signal-to-noise ratios *τ _{j}* =

Optimizing (6) over **Θ** and ** Z** involves many discrete variables and the consequent combinatorial computational challenge. Greedy hill-climbing approaches will get stuck at improper local solutions, often and quickly. The VMA2 method in Section 3 addresses this.

In the MAP estimation defined by (6), there are evident connections with traditional sparse principal component analyses (sparse PCA; Jolliffe et al. (2003); Zou et al. (2006); d’Aspremont et al. (2007)). If **Ψ** = ** I** and

The direct sparse PCA of d’Aspremont et al. (2007) imposes an upper-bound *d* > 0 on the cardinality of * z_{j}* (the number of non-zero elements), with a resulting semidefinite programming of computational complexity
$O({p}^{4}\sqrt{\mathit{log}(p)})$. The applicability of that approach is therefore limited to problems with

The SCoTLASS algorithm of Jolliffe et al. (2003) uses _{1}-regularization on loading vectors, later extended to SPCA using elastic nets by Zou et al. (2006). Recently, Mairal et al. (2009) presented a _{1}-based dictionary learning for sparse coding in which the method aims to explore sparsity on factor-sample mapping rather than that on factor-variable relations. Setting Laplace-like prior distributions on scale loadings is a counterpart of _{1}-based penalization (Jolliffe et al., 2003; Zou et al., 2006). However, our model-based perspective aims for a more probabilistic analysis, with advantages in probabilistic assessment of appropriate dimension of the latent factor space as well as flexibility in the determination of effective degrees of sparseness via the additional parameters ** ζ**. Other than the preceding studies,

Key advances in Bayesian sparse factor analysis build on non-parametric Bayesian modelling in Griffiths and Ghahramani (2006) and Rai and Daumé (2009), and developments in Carvalho et al. (2008) stemming from the original sparse Bayesian models in West (2003). Carvalho et al develop MCMC and stochastic search methods for posterior exploration. MCMC posterior sampling can be effective but is hugely challenged as the dimensions of data and factor variables increase. Our focus here is MAP evaluation with a view to scaling to increasingly large dimensions, and we leave open the opportunities for future work on MCMC methods in GFMs.

Most importantly, as remarked in Section 2.2, the GFM differs from some of the forgoing models in the conditional independence graphical structures induced. This characteristic contributes to preserving sparse structure in the data compression/reconstruction process and also to the outlier robustness issue. We leave further comparative discussion to Section 7.1, where we evaluate some of the foregoing methods relative to the sparse GFM analysis in an image processing study.

Relative to (6), consider the class of extended objective functions

$${G}_{T}(\mathbf{\Theta},\mathit{\omega})=\sum _{Z\in Z}\omega (\mathit{Z})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}p(\mathit{X},\mathit{Z},\mathbf{\Theta}\mid \mathit{\zeta})-T\sum _{Z\in Z}\omega (\mathit{Z})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\omega (\mathit{Z})$$

(8)

where *ω*(** Z**)—the

Now, view (8) as a criterion to maximize over (**Θ, ω**) jointly for any given

*For any given parameters* **Θ** *and temperature T, (*8*) is maximized with respect to ω at*

$${\mathit{\omega}}_{T}(\mathit{Z})\propto p{(\mathit{Z}\mid \mathit{X},\mathbf{\Theta},\mathit{\zeta})}^{1/T}.$$

(9)

See the Appendix.

For any given **Θ**, a large *T* leads to *ω _{T}* (

- As
*T*→ 0,*ω*(_{T}) converges to a distribution degenerate at the conditional mode*Z***(****Θ,**) of*ζ**p*(*Z*), so that*X*, Θ,*ζ* - joint maximization of
*G*(_{T}**Θ,**) would approach the*ω**global maximum of the exact posterior p*(**Θ,***Z*) as**X, ζ***T*→ 0.

The notion of the annealing operation is to realize a gradual move of successively-generated solutions for **Θ** and *ω _{T}* (

To define and implement a specific algorithm, we constrain the otherwise arbitrary “*artificial configuration probabilities*” ** ω**, and do so using a construction that induces analytic tractability. We specify the simplest, factorized form

$$\omega (\mathit{Z})=\prod _{g=1}^{p}\prod _{j=1}^{k}\omega ({z}_{\mathit{gj}}):=\prod _{g=1}^{p}\prod _{j=1}^{k}{\omega}_{\mathit{gj}}^{{z}_{\mathit{gj}}}{(1-{\omega}_{\mathit{gj}})}^{1-{z}_{\mathit{gj}}}$$

in the same way as conventional Variational Bayes (VB) procedures do. In this GFM context, the resulting optimization is eased using this independence relaxation as it gives rise to tractability in computing the conditional expectation in the first term of (8).

If *T* = 1, and given the factorized ** ω**, the objective function

$$\mathrm{log}\sum _{Z\in Z}p(\mathit{X},\mathbf{\Theta},\mathit{Z}\mid \mathit{\zeta})\ge {G}_{1}(\mathbf{\Theta},\mathit{\omega}).$$

The lower-bound *G*_{1} is the criterion that the conventional VB methods aim to maximize (Wainwright and Jordan, 2008). This indicates that any solutions corresponding to the VB inference can be obtained by stopping the cooling schedule at *T* = 1 in our method. Similar ideas have, of course, been applied in deterministic annealing EM and annealed VB algorithms (e.g. Ueda and Nakano, 1998). These methods exploit annealing schemes to escape from local traps during coordinate-basis updates in aiming to define variational approximations of posteriors.

Even with this relaxation, maximization over *ω*(** Z**) cannot be done for all elements of

$${\omega}_{\mathit{gj}}(T)\propto \phantom{\rule{0.2em}{0ex}}\mathrm{exp}\phantom{\rule{0.2em}{0ex}}\left\{\frac{1}{T}\sum _{{Z}_{C\backslash \left\{g,j\right\}}}\prod _{h\ne g}\prod _{l\ne j}\omega ({z}_{\mathit{hl}})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}p({z}_{\mathit{gj}}=1\mid \mathit{X},{\mathit{Z}}_{C\backslash \left\{g,j\right\}},\mathbf{\Theta},\mathit{\zeta})\right\}$$

(10)

where *C* denotes the collection of all indices (*g, j*) for the *p* features and *k* factor variables, *C* \ {*g, j*} is the set of the paired indices (*h, l*) such that (*h, l*) ≠ (*g, j*), and *z*_{C\{g,j}} stands for the set of *z _{hl}*s other than

Starting with *ω _{gj}* 1/2 at an initial large value of

$${\widehat{z}}_{\mathit{gj}}:=\underset{T\downarrow 0}{\mathrm{lim}}{\omega}_{\mathit{gj}}(T)=\{\begin{array}{cc}1,& \mathrm{if}\sum _{{Z}_{C\backslash \left\{g,j\right\}}}\prod _{h\ne g}\prod _{l\ne j}\omega ({z}_{hl})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\frac{p({z}_{\mathit{gj}}=1,\mathit{X},{\mathit{Z}}_{C\backslash \left\{g,j\right\}},\mathbf{\Theta},\mathit{\zeta})}{p({z}_{\mathit{gj}}=0,\mathit{X},{\mathit{Z}}_{C\backslash \left\{g,j\right\}},\mathbf{\Theta},\mathit{\zeta})}>0,\\ 0,& \mathrm{otherwise}.\end{array}$$

It remains true that, at the limiting zero temperature, the global maximum of *G _{T}* (

$$\underset{Z}{\mathrm{sup}\phantom{\rule{0.2em}{0ex}}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}p(\mathit{X},\mathbf{\Theta},\mathit{Z}\mid \mathit{\zeta})=\underset{\omega}{\mathrm{sup}\phantom{\rule{0.2em}{0ex}}}{G}_{0}(\mathbf{\Theta},\mathit{\omega})$$

(11)

with the point mass *ω*(** Z**) =

It is stressed that the coordinate-basis updates (10) cannot, of course, guarantee convergence to the *global* optimum even with prescribed annealing. Nevertheless, VMA2 represents a substantial advance in its ability to move more freely and escape local mode traps. We also note the generality of the idea, beyond factor models and also potentially using penalty functions other than entropy.

Computations alternate between conditional maximization steps for ** ω** and

- Set a cooling schedule
*τ*= {*T*_{1},…,*T*} of length_{d}*d*where*T*= 0;_{d} - Set
;*ζ* - Initialize
**Θ**; - Initialize
*ω*();*Z* *i*← 0;- while ({the loop is not converged }∧{
*i*≤*d*}) *i*←*i*+1;- Compute configuration probabilities
*ω*(_{gj}*T*);_{i} - Optimize with respect to each column
(**ϕ**_{j}*j*= 1 :*k*) of**Φ**in turn under full-conditioning; - Optimize with respect to
**Δ**under full-conditioning; - Optimize with respect to
**Ψ**under full-conditioning; - Optimize with respect to
under full-conditioning;*ζ* - end while

We now summarize key components in the iterative, annealed computation defined above.

First consider maximization with respect to each sparse configuration probability *ω _{gj}* conditional on all others. We note that the first term in (8) involves the expectation over

$${\mathrm{E}}_{\mathit{\omega}}[\mathit{S}({z}_{j,}\mathrm{\Psi})]={\mathbf{\Omega}}_{j}\circ ({\mathbf{\Psi}}^{-1/2}\mathit{S}{\mathbf{\Psi}}^{-1/2})\phantom{\rule{0.2em}{0ex}}\mathrm{with}\phantom{\rule{0.2em}{0ex}}{({\mathbf{\Omega}}_{j})}_{\mathit{gh}}=\{\begin{array}{cc}{\omega}_{\mathit{gj},}& \mathrm{if}\phantom{\rule{0.2em}{0ex}}g=h,\\ {\omega}_{\mathit{gj}}{\omega}_{\mathit{hj},}& \mathrm{otherwise}.\end{array}$$

(12)

Introduce the notation **Ψ**^{−1/2}*S*Ψ^{−1/2} = (*s*_{1}(**Ψ**),…, * s_{p}*(

$${\stackrel{~}{\omega}}_{\mathit{gj}}={({\omega}_{1j},\dots ,{\omega}_{g-1,j},1,{\omega}_{g+1,j},\dots ,{\omega}_{\mathit{pj}})}^{\prime}.$$

Then, the partial derivative of (8) with respect to *ω _{gj}* conditional on

$$\mathrm{logit}({\omega}_{\mathit{gj}}(T))={H}_{\mathit{gj}}({\zeta}_{\mathit{gj}})/T\phantom{\rule{0.2em}{0ex}}\mathrm{where}\phantom{\rule{0.2em}{0ex}}{H}_{\mathit{gj}}({\zeta}_{\mathit{gj}}):={\tau}_{j}{\varphi}_{\mathit{gj}}{({\mathit{\varphi}}_{j}\circ {\stackrel{~}{\mathit{\omega}}}_{\mathit{gj}})}^{\prime}{\mathit{s}}_{g}(\mathbf{\Psi})-{\zeta}_{\mathit{gj}}.$$

(13)

This directly yields the conditional maximizer for *ω _{gj}* in terms of the tempered negative energy

The terms in (8) that involve **Φ** are simply the expectation of the quadratic forms in the last term of (6), with the term for each column * ϕ_{j}* involving the key matrices

Conditional optimization then reduces to the following: for each *j* = 1 : *k*, sequence through each column * ϕ_{j}* in turn and at each step

$$\begin{array}{cc}\hfill \underset{{\varphi}_{j}}{\mathrm{maximize}}& {\mathit{\varphi}}_{j}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathbf{\Psi})]{\mathit{\varphi}}_{j}\hfill \\ \hfill \mathrm{subject\; to}& {\mathit{\varphi}}_{j}^{\prime}{\mathit{\varphi}}_{j}=1\phantom{\rule{0.5em}{0ex}}\mathrm{and}\phantom{\rule{0.5em}{0ex}}{\mathit{\varphi}}_{m}^{\prime}{\mathit{\varphi}}_{j}=0\phantom{\rule{0.2em}{0ex}}\mathrm{for}\phantom{\rule{0.2em}{0ex}}m\ne j,m=1:k.\hfill \end{array}$$

(14)

The optimization conditions on the most recently updated values of all other columns *m* ≠ *j* at each step, and is performed as one sweep as the line 9 in the algorithm of Section 4.1. Column order can be chosen randomly or systematically each time while still maintaining convergence. In this step, we stress that the original orthogonality condition is modified to
${\mathbf{\Phi}}_{Z}^{\prime}{\mathbf{\Phi}}_{Z}=\mathit{I}\to {\mathbf{\Phi}}^{T}\mathbf{\Phi}=\mathit{I}$ in (14). It remains the case that iteratively refined estimates obtained from (14) satisfy the original condition at the limiting zero temperature, yielding sparsity for E* _{ω}*[

The specific computations required for the conditional optimization in (14) are as follows (with supporting details in the Appendix). Note that the central matrices E* _{ω}*[

- Compute the
*p*× (*k*− 1) matrix**Φ**_{(−j)}= {}**ϕ**_{m}_{m≠j}by simply deleting column*j*from**Φ**; - Compute the
*p*×*p*projection matrix ${\mathit{N}}_{j}={\mathit{I}}_{p}-{\mathbf{\Phi}}_{(-j)}{\mathbf{\Phi}}_{(-j)}^{\prime}$; - Compute the eigenvector
corresponding to the most dominant eigenvalue of_{j}E**N**_{j}[_{ω}(*S**z*, Ψ)]_{j};**N**_{j} - Compute the required optimal vector
=**ϕ**_{j}/**N**_{j}_{j}.**N**_{j}_{j}

This procedure solves (14) by optimizing over an eigenvector already constrained by the orthogonality conditions. Here * N_{j}* spans the null space of the current

The variances *δ _{j}* of the latent factors appear in equations (6) and (8) in the sum over

$$-n\phantom{\rule{0.2em}{0ex}}\mathrm{log}(1+{\delta}_{j})+{\delta}_{j}{(1+{\delta}_{j})}^{-1}{\mathit{\varphi}}_{j}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}.$$

(15)

This is unimodal in *δ _{j}* with maximizing value

$${\widehat{\delta}}_{j}=\mathrm{max}\{0,{n}^{-1}{\mathit{\varphi}}_{j}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}-1\},$$

(16)

and so the update at the line 10 of the MAP algorithm of Section 4.1 computes these values in parallel for each factor *j* = 1 : *k*. Note that this may generate zero values, indicating the removal of the corresponding factors from the model, and so inducing an intrinsic ability to prune the number of factors as being redundant in a model specified initially with a larger, encompassing value of *k*. The configured sparse structure drives this pruning; any specific factor *j* that is inherently very sparse generates a smaller value of the projected “variance explained”
${\mathit{\varphi}}_{j}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}$, and so can lead to * _{j}* = 0 as a result.

The diagonal noise covariance matrix **Ψ** appears in the objective function of equation (8) in terms that can be re-expressed as

$$-n\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\mid \mathbf{\Psi}\mid -\phantom{\rule{0.2em}{0ex}}\mathrm{trace}(\mathit{S}{\mathbf{\Psi}}^{-1})+\sum _{j=1}^{k}{\tau}_{j}\mathrm{trace}({\mathit{\varphi}}_{j}{\mathit{\varphi}}_{j}^{\prime}{\mathbf{\Psi}}^{-1/2}({\mathbf{\Omega}}_{j}\circ \mathit{S}){\mathbf{\Psi}}^{-1/2})$$

where *τ _{j}* =

$$n{\mathrm{diag}}^{-1}({\mathbf{\Psi}}^{1/2})-{\mathrm{diag}}^{-1}(\mathit{S}{\mathbf{\Psi}}^{-1/2})+\mathrm{\sum}_{j=1}^{k}{\tau}_{j}{\mathrm{diag}}^{-1}({\mathit{\varphi}}_{j}{\mathit{\varphi}}_{j}^{\prime}{\mathbf{\Psi}}^{-1/2}({\mathbf{\Omega}}_{j}\circ \mathit{S}))=\mathbf{0},$$

(17)

where diag^{−1}(** A**) denotes the vector of the diagonal elements in

$${\mathrm{diag}}^{-1}(\mathbf{\Psi})={n}^{-1}{\mathrm{diag}}^{-1}(\{{\mathit{I}}_{p}-\sum _{j=1}^{k}{\tau}_{j}({\mathit{\varphi}}_{j}{\mathit{\varphi}}_{j}^{\prime})\circ ({\mathbf{\Psi}}^{-1/2}{\mathbf{\Omega}}_{j}{\mathbf{\Psi}}^{1/2})\}\mathit{S}).$$

(18)

The prior over the logistic hyperparameters ** ζ** = {

$${\omega}_{\mathit{gj}}=\frac{\mathrm{exp}(-{\zeta}_{\mathit{gj}}/2)}{1+\mathrm{exp}(-{\zeta}_{\mathit{gj}}/2)}-\frac{{\zeta}_{\mathit{gj}}-\mu}{2\sigma}.$$

(19)

Solutions to (19) are trivially, iteratively computed. Evidently, as *ω _{gj}* approaches 0 or 1, the solution for

As mentioned earlier, the choice of this logit/truncated normal prior is a subjective preference and could be replaced by others, such as beta priors. Again, we expect that this would typically lead to modest differences, if any of practical relevance, in many cases.

In problems with larger numbers of variables, the computations quickly become challenging, especially in view of the repeated eigen-decompositions required for updating factor loading matrix. In our examples and experiments, analysis with dimensions *p* ~ 500 would be feasible using our own R code (vma2gfm ( ) available from the supplementary web site), but computation time then rapidly increases with increasing *p*. More efficient low level coding will speed this, but nevertheless it is of interest to explore additional opportunities for increasing the efficiency of the MAP search.

To reduce the computational time, we explore a stochastic variant of the original deterministic VMA2 that uses realized ** Z** matrices from current, conditional configuration probabilities

The modified search procedure over * ϕ_{j}* in equation (14) is:

- Draw a set of binary values
= 1,…,_{gj}, g*p*, according to the current configuration probabilities*ω*(_{gj}*T*); - Define the set of
*active variables*by*A*= {_{j}*gg*1 :*p,*= 1}; denote by_{gj}*ϕ*_{j,{Aj}}the sub-vector offor only the active variables, and**ϕ**_{j}*S*_{{Aj}}(*z*, Ψ) the submatrix of_{j}(*S**z*, Ψ) whose rows and columns correspond to only the active variables;_{j} - Solve the reduced optimization conditional on the
*A*, via:_{j}$$\begin{array}{cc}\hfill \underset{{\varphi}_{j,\{{A}_{j}\}}}{\mathrm{maximize}}& {\mathit{\varphi}}_{j,\{{A}_{j}\}}^{\prime}{\mathit{S}}_{\{{A}_{j}\}}({\widehat{z}}_{j},\mathrm{\Psi}){\mathit{\varphi}}_{j,\{{A}_{j}\}}\hfill \\ \hfill \mathrm{subject\; to}& \parallel {\mathit{\varphi}}_{j,\{{A}_{j}\}}{\parallel}^{2}=1\phantom{\rule{0.2em}{0ex}}\mathrm{and}\phantom{\rule{0.2em}{0ex}}{\mathit{\varphi}}_{m,\{{A}_{j}\}}^{\prime}{\mathit{\varphi}}_{j,\{{A}_{j}\}}=0\phantom{\rule{0.2em}{0ex}}\mathrm{for}\phantom{\rule{0.2em}{0ex}}m\ne j.\hfill \end{array}$$ - Update the full
*p*−vectorwith elements**ϕ**_{j}*ϕ*_{j,{Aj}}for the active variables and all other elements zero.

For example, in a problem with *p* = 5000 but sparseness of the order of 5%, the *A _{j}* will involve a few hundred active variables, and eigenvalue decomposition will then be performed on matrices of that order rather than 5000 × 5000. We note also that this strategy requires a modification to the update operation for the configuration probabilities: the

The first experiment shows how the VMA2 method can solve the combinatorial optimization. Consider 3 variables and 1 factor, so that * x_{i}* = (

We can map the surface *G _{T}* (

Display of evolving *G*_{T} (**, ***ω*) in the annealing process (from *T* = 2 to *T* = 0) with contour plots. The black circle in each panel indicates the maximum point, and that corresponding to *T* = 0 in the panel on the bottom-right corner **...**

Figure 2 also shows a tracking result of the VMA2 search process starting from *T* = 2 and stopping at *T* = 0. The change in *G _{T}* (

This simple illustrative example highlights the key to success in the search: moving the trajectory of solutions closer to the global maximum in earlier phases of the cooling schedule, before the tempered criterion function exhibits substantial multimodality. Looking ahead, we may be able to raise the power of the annealing search by, for example, using dynamic control of the cooling schedule or more general penalty functions for *ω*.

In what follows, we will show some simulation studies to provide insights and comparisons. The data sets have *n* = 100 data points drawn from the GFM with *p* = 30 and *k*_{true} = 4, and with **Ψ** = 0.05** I** and

To explore sensitivity to the chosen temperature schedule for annealing, experiments were run using three settings:

- (Log-inverse decay)
*T*= 3/log_{i}_{2}(*i*+ 1) for*i*= 1,…, 6999, and*T*_{7000}= 0 - (Linear decay)
*T*= 3−6 × 10_{i}^{3}× (*i*− 1) for*i*= 1,…, 1999, and*T*_{2000}= 0 - (Power decay)
*T*= 3 × 0.99_{i}^{−(i−1)}for*i*= 1,…, 1999, and*T*_{2000}= 0

For each, we evaluated the resulting MAP search in terms of comparison with the true model and computational efficiency, in each case using a model with redundant factor dimension *k* = 8.

First analyses fixed *ζ _{gj}* =

A second analysis uses the sparsity prior *p*(*ζ _{gj}*) =

Result of the VMA2 estimation using the log-inverse rate cooling in analysis of synthetic data. (Left) Precision and factor loadings matrix used for generating the synthetic data (*p* = 30, *k*_{true} = 4). Non-zero elements are colored black. (Right) Estimated **...**

Convergence trajectories of the *ζ*_{gj} (upper) and *ω*_{gj} (lower) in analysis of synthetic data over 2000 steps of annealed MAP estimation.

We further evaluated sensitivity to the choice of cooling schedules; in addition to the previous three cooling schedules, we compared with:

- (Log-inverse decay)
*T*= 0.7/log_{i}_{2}(*i*+ 1) for*i*= 1,…, 6999 and,*T*_{7000}= 0 - (Linear decay)
*T*= 0.7 − 6 × 10_{i}^{3}× (*i*− 1) for*i*= 1,…, 1999 and,*T*_{2000}= 0 - (Power decay)
*T*= 0.7 × 0.99_{i}^{−(i−1)}for*i*= 1,…, 1999 and,*T*_{2000}= 0

The initial temperatures are reduced from 3 to 0.7. Figure 6 shows the variations of TPR and FPR in the use of the six cooling schedules, evaluated in 20 analyses with replicated synthetic models and data sets. The left and center panels indicate significant dominance of the annealing starting from the higher initial temperatures. Performance in identifying model structure seriously degrades when using a temperature schedule that starts too low, and the sensitivity to schedule is very limited when beginning with reasonably high initial temperatures.

Performance tests on 20 synthetic data sets for different cooling schedules and comparison between the GFM and a sparse PCA (SPCA). For each panel, TPR (black) and FPR (blue) are plotted (vertical axis) against the 20 replicate simulations of artificial **...**

The right panel in Figure 6 shows TPR and FPR for the sparse PCA (SPCA) proposed by Zou et al. (2006), evaluated on the same 20 data sets using the R code spca() available at CRAN (http://cran.r-project.org/). With spca(), we can specify the number of nonzero elements (cardinality) in each column of the factor loading matrix. We executed spca() after the assignment of the true cardinality as well as the known factor dimension *k*_{true} = 4. The figure indicates a better performance of GFM annealed with high initial temperature than the sparse PCA, and this is particularly notable in that the GFM analysis uses *k* = 8 and involves no *a priori* knowledge on the degree of sparseness. It is important to see that the conducted comparison is biased since the data were drawn from the GFM with the orthogonal loading matrix where SPCA does not make orthogonality assumptions. In Section 7.1, we provide deeper comparisons among several existing sparse factor analyses based on image processing in hand-written digits recognition.

Figure 7 shows the CPU times required for the execution of the GFM analyses as above, repeated with increasing dimension *p* {100, 200, 300, 500, 700, 1000}. The data sets were again generated from GFMs with 4 factors and roughly 70% sparseness. We then performed the VMA2 using a linear decay cooling of length 2000, and using both deterministic and stochastic annealing in a model with *k* = 8. The deterministic algorithm was not used for *p* ≥ 500 due to substantial increase in CPU times; this was eased via use of the stochastic search algorithm.

We evaluate GFM in pattern recognition analyses of hand-written digit images, and make comparisons to three existing methods; (i) SPCA (Zou et al., 2006), (ii) sparse probabilistic PCA with ARD prior (Archambeau and Bach, 2009), and (iii) MCMC-driven sparse factor analysis (West, 2003; Carvalho et al., 2008). These three methods are all based on models with non-orthogonal sparse loading matrices. The training data set was made from 16 × 16 gray-scale images of 100 digits (i.e. *n* = 100, *p* = 256) of ‘3’ that were randomly drawn from the postal zip code data available at http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (Hastie et al., 2001). To evaluate robustness of the four approaches, we added artificial outliers to 15 pixels (features) for about 5% of the original 100 images. Some of the contaminated images are shown in the top-left panels of Figure 8.

Comparison between GFM and the three alternative methods ((i)-(iii)) in the data reconstruction of outlier hand-written digit images. For the implementation of the sparse probabilistic PCA with ARD prior (ARD-PPCA), we prepared our own R function which **...**

For the non-probabilistic method, i.e. (i) SPCA, we performed data reconstruction in the standard manner; ** x**(

A set of 100 test data samples was created from the 100 samples above by adding outliers drawn from a uniform distribution to randomly-chosen pixels with probability 0.2. Performance of the four approaches to data compressions/reconstruction were assessed via mean square error (MSE) between ** x**(

Latent factor models are being more used in microarray-based gene expression studies in both basic biological and clinical studies, such as cancer genomics. An example in breast cancer gene expression study here further illustrates the practical relevance of GFM structure and adds to comparisons with other approaches. In addition to the summary details below, a much extended discussion of both statistical and biological aspects is available in supporting material at the first author’s web site (see link below).

Among the goals of most such studies are identification of multiple factors that may represent underlying activity of biological pathways and provide opportunities for improved predictive models with estimated latent factors for physiological and clinical outcomes (e.g. Carvalho et al., 2008; Chang et al., 2009; Hirose et al., 2008; Lucas et al., 2006, 2009; West, 2003; Yoshida et al., 2004, 2006). Here we discuss an example application of our sparse GFM in analysis of data from previously published breast cancer studies (Cheng et al., 2006; Huang et al., 2003; Pittman et al., 2004; West et al., 2001).

The GFM approach was applied to a sample of gene expression microarray data collected in the CODeX breast cancer genomics study (Carvalho et al., 2008; Pittman et al., 2004; West et al., 2001) at the Sun-Yat Sen Cancer Center, Taipei, during 2001-2004. In addition to summary expression indices from Affymetrix Human Genome U95 arrays, the data set includes immunohistochemistry (IHC) test for key hormonal receptor proteins in clinical prognostics; ERBB2 (Her2) and estrogen (ER). The IHC measures are discrete: ER negative (ER=0), ER positive with low/high-level expression (ER=1 and ER=2), Her2 negative (Her2=0), and Her2 positive with low/high-level (Her2=1 and Her=2). We performed analysis of *p* = 996 genes with the expression levels that, on a log2 (fold change) scale, exceed a median level of 7 and a range of at least 3-fold changes across the tumors. The data set, including the expression data and the IHC hormonal measures, are available on-line as supplementary material.

The annealed estimation of GFM was run with *k* = 25, *μ* = 7 and *σ* = 10. The cooling schedule was prescribed by a linearly-decreasing sequence of 2000 temperatures under which the decay rate and initial temperature were set to 0.006 and 3, respectively. The applied GFM identified 19 factors, pruning from the model maximum *k* = 25. Heatmaps of gene expression for genes identified in each of the factors appear in Figure 9 with the identified sparse pattern of the loadings matrix.

To investigate potential biological connections of the factors, we evaluated enrichment of the functional annotations shared by genes in each factor through the Gene Ontology (GO). This exploration revealed associations between some factors and GO biological processes; the complete and detailed results, including tables of the GO enrichment analyses for each factor and detailed biological descriptions, are available from the web site of supporting information.

Figure 10 displays boxplots of fitted values of the factor scores for each sample, plotted across all 19 factors and stratified by levels of each of the clinical ER and Her2 (0/1/2) categories. For each sample *i*, the posterior mean of the factor vector, namely
${\widehat{\mathit{\lambda}}}_{i}={({\mathit{I}}_{k}+\mathbf{\Delta})}^{-1}\mathbf{\Delta}{\mathbf{\Phi}}_{Z}^{\prime}{\mathbf{\Psi}}^{-1/2}{\mathit{x}}_{i}$, is evaluated at the estimated model, providing the fitted values displayed. We note strong association of ER status to factors 8 (GO: hormone metabolic process), 9 (GO: glucose metabolic process, negative regulation of MAPK activity), 12 (GO: C21-steroid hormone metabolic process), 14 (GO: apoptotic program, positive regulation of caspase activity), 18 (GO: M phase of meiotic cell cycle) and 19 (GO: regulation of Rab protein signal transduction). These clear relationships of ER status to multiple factors with intersecting but also distinct biological pathway annotations is consistent with the known complexity of the broader ER network, as estrogen receptor-induced signaling impacts multiple cellular growth and developmentally related downstream target genes and strongly defines expression factors linked to breast cancer progression.

Figure 10 indicates factor 16 as strongly associated with Her2 status (0, 1) versus 2. Factor 16 significantly loads on only 7 genes that include STARD3, GRB7 and two probe sets on the locus of ERBB2 (which encodes Her2). This is consistent with earlier gene expression studies that have consistently identified a single expression pattern related to Her2 and a very small number of additional genes, and that have found the “low Her2 positives” level(1) to be generally comparable to negatives. Interestingly, we note that STARD3, GRB7 and ERBB2 are all located on the same chromosomal locus 17q12, which is known as PPP1R1B-STARD3-TCAP-PNMT-PERLD1-ERBB2-MGC14832-GRB7 locus. This locus has been reported in many studies (e.g. Katoh and Katoh, 2004) as an oncogenomic recombination hotspot which is amplified frequently in breast tumor cells, and the purely exploratory (i.e., unsupervised) GFM analysis clearly identifies the “Her2 factor” as strongly reflective of increased expression of genes in this hotspot, consistent with the amplification inducing Her2 positivity.

Finally, we show a comparison to non-sparse traditional PCA. Supplementary Fig.1 and 2 show the estimated factors (principal components) corresponding to the most dominant 19 eigenvalues, stratified by the levels of ER and Her2. The PCA failed to capture the existing factor relevant to Her2-specific phenotypes in the analysed data. Note that the foregoing sparse analysis identified the Her2-relevant factor only through the 7 non-zero loadings. Indeed, our post-analysis has found that the data set contains very few genes exhibiting significant fold change across the Her2 phenotypes. The non-sparse analysis would capture many irrelevant features through too redundant nonzero loadings. The failure of PCA signifies the importance of sparse modelling in handling high-dimensional data having inherently sparse structure.

The novel graphical property of GFMs provides a nice reconciliation of sparse covariance models with sparse precision models—sparse latent factor analysis and graphical models, respectively. Some of the practical benefits of this arise from the ability of GFM to define data reconstructions exhibiting the same patterns of covariances as the model/data predict, and the potential to induce robustness to outliers relative to non-graphical factor models, whether sparse or not. Some theoretical questions remain about precise conditions under which the sparsity patterns of covariance and precision matrices are guaranteed to agree in general sparse Gaussian factor models other than the GFM form. Additionally, extensions to integrate non-parametric Bayesian model components for factors, following Carvalho et al. (2008), are of clear future interest.

The ability of the VMA2 to aid in the identification of model structure in sparse GFM, and to provide an additional computational strategy and tools to address the inherently challenging combinatorial optimization problem, has been demonstrated in our examples. Scaling to higher dimensional models is enabled by relaxation of the direct deterministic optimization viewpoint, with stochastic search components that promote greater exploration of model space and can speed up search substantially. Nevertheless, moving to higher dimensions will require new, creative computational implementations, such as using distributed computing, that will themselves require novel methodological concepts.

The annealed search methodology evidently will apply in other contexts beyond factor models. At one level, sparse factor models are an instance of problems of variable selection in multivariate regression, in which the regression predictors (feature variables) are themselves unknown (i.e., are the factors). The annealed entropy approach is therefore in principle applicable to problems involving regression model search and uncertainly in general classes of linear or nonlinear multivariate regression with potentially many predictor variables. Beyond this, the same can be said about potential uses in other areas of graphical modelling involving structural inference of directed or undirected graphical models, and also in multivariate time series problems where some of the sparse structure may relate to relationships among variables over time.

We also remark on generalization of the basic form of VMA2 here that might utilize penalty functions other than the Shannon’s entropy used here. The central idea of the VMA2 is the design of a temperature-controlled iterative optimization that converges to the joint posterior distribution of model parameters and sparse structure indicators. The entropy formulation used in our GFM context was inspired by the form of the posterior itself, but similar algorithms—with the same convergent property—could be designed using other forms. This, along with computational efficiency questions and applications in models beyond the sparse GFM framework, and also potential extensions to consider heavy-tailed or Bayesian nonparametric distributions for latent factors and/or residuals (e.g. Carvalho et al., 2008), are open areas for future research.

Click here to view.^{(160K, pdf)}

Click here to view.^{(1.6K, R)}

Click here to view.^{(3.3K, r)}

Click here to view.^{(55K, txt)}

The authors are grateful to the Action Editor and three anonymous referees for their detailed and most constructive comments on the original version of this paper. Elements of the research reported here were developed while Ryo Yoshida was visiting SAMSI and Duke University during 2008-09. Aspects of the research of Mike West were partially supported by grants from the U.S. National Science Foundation (DMS-0342172) and National Institutes of Health (NCI U54-CA-112952). The research of Ryo Yoshida was partially supported by the Japan Science and Technology Agency (JST) under the Core Research for Evolutional Science and Technology (CREST) program. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.

Replace the objective function of (8) by multiplying by inverse temperature 1/*T*:

$$\frac{1}{T}{G}_{T}(\mathbf{\Theta},\mathit{\omega})=\sum _{Z\in Z}\omega (\mathit{Z})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}p{(\mathit{X},\mathit{Z},\mathbf{\Theta}\mid \mathit{\zeta})}^{1/T}-\sum _{Z\in Z}\omega (\mathit{Z})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\omega (\mathit{Z}).$$

An upper-bound of this modified criterion is derived as follows:

$$\begin{array}{lll}\frac{1}{T}{G}_{T}(\mathbf{\Theta},\mathit{\omega})& =& \sum _{Z\in Z}\omega (\mathit{Z})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\frac{p{(\mathit{Z}\mid \mathit{X},\mathbf{\Theta},\mathit{\zeta})}^{1/T}p{(\mathit{X},\mathbf{\Theta}\mid \mathit{\zeta})}^{1/T}}{\omega (\mathit{Z})}\\ & =& \sum _{Z\in Z}\omega (\mathit{Z})\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\frac{p{(\mathit{Z}\mid \mathit{X},\mathbf{\Theta},\mathit{\zeta})}^{1/T}}{\omega (\mathit{Z})\mathrm{\sum}_{{Z}^{\prime}\in Z}p{({\mathit{Z}}^{\prime}\mid \mathit{X},\mathbf{\Theta},\mathit{\zeta})}^{1/T}}+{K}_{0}\\ & \le & {K}_{0}.\end{array}$$

In the second equality, the terms irrelevant to *ω*(** Z**) are included in

$$\omega (\mathit{Z})=\frac{p{(\mathit{Z}\mid \mathit{X},\mathbf{\Theta},\mathit{\zeta})}^{1/T}}{{\sum}_{{Z}^{\prime}\in \mathrm{Z}}p{({\mathit{Z}}^{\prime}\mid \mathit{X},\mathbf{\Theta},\mathit{\zeta})}^{1/T}},$$

as required.

Let *ρ _{j}, j* {1,…,

$${\mathit{\varphi}}_{j}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}-{\rho}_{j}(\parallel {\mathit{\varphi}}_{j}{\parallel}^{2}-1)-\sum _{m\ne j}{\rho}_{m}{\varphi}_{m}^{\prime}{\varphi}_{j}.$$

(20)

Differentiation of (20) with respect to * ϕ_{j}* yields

$${\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}-{\rho}_{j}{\mathit{\varphi}}_{j}-\sum _{m\ne j}{\rho}_{m}{\mathit{\varphi}}_{m}=\mathbf{0}.$$

(21)

In order to solve this equation, the first step to be addressed is to find the closed form solution for the vector of the Lagrange multipliers, *ρ*_{(−j)} = {*ρ _{m}*}

$${\mathit{\varphi}}_{m}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}-\sum _{m\ne j}{\rho}_{m}{\mathit{\varphi}}_{m}^{\prime}{\mathit{\varphi}}_{j}=\mathbf{0}\phantom{\rule{0.2em}{0ex}}\mathrm{for}\phantom{\rule{0.2em}{0ex}}m\phantom{\rule{0.2em}{0ex}}\mathrm{s}.\mathrm{t}.m\ne j.$$

This yields the matrix representation

$${\mathbf{\Phi}}_{(-j)}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}-{\mathbf{\Phi}}_{(-j)}^{\prime}{\mathbf{\Phi}}_{(-j)}{\mathit{\rho}}_{(-j)}=0,$$

which in turn leads to the solution for *ρ*_{(−j)} as

$${\mathit{\rho}}_{(-j)}={({\mathbf{\Phi}}_{(-j)}^{\prime}{\mathbf{\Phi}}_{(-j)})}^{-1}{\mathbf{\Phi}}_{(-j)}^{\prime}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}.$$

Substituting this into the original equation (21) yields the eigenvalue equation

$${\mathit{N}}_{j}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{\varphi}}_{j}-{\rho}_{j}{\mathit{\varphi}}_{j}=\mathbf{0}\phantom{\rule{0.5em}{0ex}}\mathrm{with}\phantom{\rule{0.2em}{0ex}}{\mathit{N}}_{j}=\mathit{I}-{\mathbf{\Phi}}_{(-j)}{\mathbf{\Phi}}_{(-j)}^{\prime}.$$

(22)

Now consider the alternative, symmetrized eigenvalue equation

$${\mathit{N}}_{j}{\mathrm{E}}_{\omega}[\mathit{S}({z}_{j},\mathrm{\Psi})]{\mathit{N}}_{j}{\phi}_{j}-{\rho}_{j}{\phi}_{j}=\mathbf{0}.$$

(23)

Ryo Yoshida, Department of Statistical Modeling, Institute of Statistical Mathematics, Minato-ku, Tokyo 106-8569, Japan.

Mike West, Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA.

- Anderson TW. An Introduction to Multivariate Statistical Analysis. 3. Wiley-Interscience; New Jersey: 2003.
- Archambeau C, Bach F. Sparse probabilistic projections. In: Koller D, Schuurmans D, Bengio Y, Bottou L, editors. Advances in Neural Information Processing Systems. Vol. 21. Cambridge, MA: MIT Press; 2009. pp. 73–80.
- Attias H. Inferring parameters and structure of latent variable models by variational Bayes. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI99) 1999:21–30.
- Bishop CM. Probabilistic principal component analysis. Journal of the Royal Statistical Society (Series B) 1999;61:611–622.
- Bishop CM. Pattern Recognition and Machine Learning. 1. Springer; Singapore: 2006.
- Carvalho CM, West M. Dynamic matrix-variate graphical models. Bayesian Analysis. 2007;2:69–98.
- Carvalho CM, Lucas JE, Wang Q, Chang JT, Nevins JR, West M. High-dimensional sparse factor modelling: Applications in gene expression genomics. Journal of the American Statistical Association. 2008;103:1438–1456. [PMC free article] [PubMed]
- Chang J, Carvalho CM, Mori S, Bild A, Wang Q, West M, Nevins JR. Decomposing cellular signaling pathways into functional units: A genomic strategy. Molecular Cell. 2009;34:104–114. [PMC free article] [PubMed]
- Cheng SH, West M, Horng CF, Huang E, Pittman J, Dressman H, Tsou MH, Chen CM, Tsail SY, Jian JJ, Nevins JR, Liu MC, Huang AT. Genomic prediction of loco-regional recurrence following mastectomy in breast cancer. Journal of Clinical Oncology. 2006;24:4594–4602. [PubMed]
- d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG. A direct formulation for sparse PCA using semidefinite programming. SIAM Review. 2007;49(3):434–448.
- Dobra A, Jones B, Hans C, Nevins JR, West M. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis. 2004;90:196–212.
- Griffiths TL, Ghahramani Z. Infinite latent feature models and the indian buffet process. In: Weiss Y, Schölkopf B, Platt J, editors. Advances in Neural Information Processing Systems. Vol. 18. Cambridge, MA: MIT Press; 2006. pp. 475–482.
- Guan Y, Dy J. Sparse probabilistic principal component analysis. Proceedings of AISTATS 2009. 2009;5:185–192.
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Newv York: Springer; 2001.
- Hirose O, Yoshida R, Imoto S, Yamaguchi R, Higuchi T, Charnock-Jones DS, Print C, Miyano S. Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models. Bioinformatics. 2008;24(7):932–942. [PubMed]
- Huang E, Chen S, Dressman HK, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, West M, Nevins JR, Huang AT. Gene expression predictors of breast cancer outcomes. The Lancet. 2003;361:1590–1596. [PubMed]
- Jolliffe IT, Trendafilov NT, Uddin M. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics. 2003;112(3):531–547.
- Jones B, Dobra A, Carvalho CM, Hans C, Carter C, West M. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science. 2005;20:388–400.
- Jordan MI, editor. Learning in Graphical Models. Cambridge MA: MIT Press; 1999.
- Jordan MI. Graphical models. Statistical Science. 2004;19:140–155.
- Katoh M, Katoh M. Evolutionary recombination hotspot around GSDML-GSDM locus is closely linked to the oncogenomic recombination hotspot around the PPP1R1B-ERBB2-GRB7 amplicon. International Journal of Oncology. 2004;24:757–63. [PubMed]
- Lucas JE, Carvalho CM, Wang Q, Bild AH, Nevins JR, West M. Sparse statistical modelling in gene expression genomics. In: Müller P, Do KA, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press; 2006. pp. 155–176.
- Lucas JE, Carvalho CM, Chi J-TA, West M. Cross-study projections of genomic biomarkers: An evaluation in cancer genomics. PLoS ONE. 2009;4(2):e4523. [PMC free article] [PubMed]
- Mackay DJC. Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems. 1995;6:469–505.
- Mairal J, Bach F, Ponce J, Sapiro G. Online dictionary learning for sparse coding. Proceedings of the 26th Annual International Conference on Machine Learning (ICML2009); New York, NY, USA. 2009. pp. 689–696. ACM.
- Pittman J, Huang E, Dressman H, Horng CF, Cheng SH, Tsou MH, Chen CM, Bild A, Iversen ES, Huang AT, Nevins JR, West M. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proceedings of the National Academy of Sciences. 2004;101:8431–8436. [PubMed]
- Rai P, Daumé H. The infinite hierarchical factor regression model. In: Koller D, Schuurmans D, Bengio Y, Bottou L, editors. Advances in Neural Information Processing Systems. Vol. 21. Cambridge, MA: MIT Press; 2009. pp. 1321–1328.
- Ueda N, Nakano R. Deterministic annealing EM algorithm. Neural Networks. 1998;11:271–282. [PubMed]
- Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning. 2008;1:1–305.
- Wang Q, Carvalho CM, Lucas JE, West M. BFRM: Bayesian factor regression modelling. Bulletin of the International Society for Bayesian Analysis. 2007;14(2):4–5.
- West M. Bayesian factor regression models in the “large p, small n” paradigm. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics. Vol. 7. Oxford University Press; 2003. pp. 723–732.
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Zuzan H, Spang R, Marks JR, Nevins JR. Predicting the clinical status of human breast cancer utilizing gene expression profiles. Proceedings of the National Academy of Sciences. 2001;98:11462–11467. [PubMed]
- Yoshida R, Higuchi T, Imoto S. A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. 2004:161–172. [PubMed]
- Yoshida R, Higuchi T, Imoto S, Miyano S. Arraycluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles. Bioinformatics. 2006;22(12):1538–1539. [PubMed]
- Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of Computational and Graphical Statistics. 2006;15(2):265–286.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |