Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2943854

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Related Work
- 3. Topic Modeling with Dirichlet Forest
- 4. Inference for Dirichlet Forest
- 5. Experiments
- References

Authors

Related links

Proc Int Conf Mach Learn. Author manuscript; available in PMC 2010 September 22.

Published in final edited form as:

Proc Int Conf Mach Learn. 2009; 382(26): 25–32.

PMCID: PMC2943854

NIHMSID: NIHMS154934

David Andrzejewski: ude.csiw.sc@ejezrdna; Xiaojin Zhu: ude.csiw.sc@uhzyrrej; Mark Craven: ude.csiw.tatsoib@nevarc

See other articles in PMC that cite the published article.

Users of topic modeling methods often have knowledge about the composition of words that should have high or low probability in various topics. We incorporate such domain knowledge using a novel Dirichlet Forest prior in a Latent Dirichlet Allocation framework. The prior is a mixture of Dirichlet tree distributions with special structures. We present its construction, and inference via collapsed Gibbs sampling. Experiments on synthetic and real datasets demonstrate our model’s ability to follow and generalize beyond user-specified domain knowledge.

Topic modeling, using approaches such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), has enjoyed popularity as a way to model hidden topics in data. However, in many applications, a user may have additional knowledge about the composition of words that should have high probability in various topics. For example, in a biological application, one may prefer that the words “termination”, “disassembly” and “release” appear with high probability in the same topic, because they all describe the same phase of biological processes. Furthermore, a biologist could automatically extract these preferences from an existing biomedical ontology, such as the Gene Ontology (GO) (The Gene Ontology Consortium, 2000). As another example, an analyst may run topic modeling on a corpus of people’s wishes, inspect the resulting topics, and notice that “into, college” and “cure, cancer” all appear with high probability in the same topic. The analyst may want to interactively express the preference that the two sets of words should not appear together, re-run topic modeling, and incorporate additional preferences based on the new results. In both cases, we would like these preferences to guide the recovery of latent topics. Standard LDA lacks a mechanism for incorporating such domain knowledge.

In this paper, we propose a principled approach to the incorporation of such domain knowledge into LDA. We show that many types of knowledge can be expressed with two primitives on word pairs. Borrowing names from the constrained clustering literature (Basu et al., 2008), we call the two primitives Must-Links and Cannot-Links, although there are important differences. We then encode the set of Must-Links and Cannot-Links associated with the domain knowledge using a *Dirichlet Forest prior*, replacing the Dirichlet prior over the topic-word multinomial *p*(word|topic). The Dirichlet Forest prior is a mixture of Dirichlet tree distributions with very specific tree structures. Our approach has several advantages: (i) A Dirichlet Forest can encode Must-Links and Cannot-Links, something impossible with Dirichlet distributions. (ii) The user can control the strength of the domain knowledge by setting a parameter η, allowing domain knowledge to be overridden if the data strongly suggest otherwise. (iii) The Dirichlet Forest lends itself to efficient inference via collapsed Gibbs sampling, a property inherited from the conjugacy of Dirichlet trees. We present experiments on several synthetic datasets and two real domains, demonstrating that the resulting topics not only successfully incorporate the specified domain knowledge, but also generalize beyond it by including/excluding other related words not explicitly mentioned in the Must-Links and Cannot-Links.

We review LDA using the notation of Griffiths and Steyvers (2004). Let there be *T* topics. Let **w** = *w*_{1} … *w _{n}* represent a corpus of

$$\theta \phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Dirichlet}(\alpha )$$

(1)

$${z}_{i}|{\theta}^{({d}_{i})}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Multinominal}({\theta}^{({d}_{i})})$$

(2)

$$\varphi \phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Dirichlet}(\beta )$$

(3)

$${w}_{i}|{z}_{i},\varphi \phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}\text{Multinomial}({\varphi}_{{z}_{i}})$$

(4)

where α and β are hyperparameters for the document-topic and topic-word Dirichlet distributions, respectively. For simplicity we will assume symmetric α and β but asymmetric hyperparameters are also possible.

Previous work has modeled correlations in the LDA document-topic mixtures using the logistic Normal distribution (Blei & Lafferty, 2006), DAG (Pachinko) structures (Li & McCallum, 2006), or the Dirichlet Tree distribution (Tam & Schultz, 2007). In addition, the concept-topic model (Chemudugunta et al., 2008) employs domain knowledge through special “concept” topics, in which only a particular set of words can be present. Our work complements the previous work by encoding complex domain knowledge on *words* (especially arbitrary Cannot-Links) into a flexible and computationally efficient prior.

Our proposed model differs from LDA in the way ϕ is generated. Instead of (3), we have

$$\begin{array}{cc}\mathbf{q}\text{}~\hfill & \text{DirichletForest}(\beta ,\eta )\hfill \\ \varphi \text{}~\hfill & \text{DirichletTree}(\mathbf{q})\hfill \end{array}$$

where **q** specifies a Dirichlet tree distribution, β plays a role analogous to the topic-word hyperparameter in standard LDA, and η ≥ 1 is the “strength parameter” of the domain knowledge. Before discussing DirichletForest(β, η) and Dirichlet Tree(**q**), we first explain how knowledge can be expressed using Must-Link and Cannot-Link primitives.

Must-Links and Cannot-Links were originally proposed for constrained clustering to encourage two instances to fall into the same cluster or into separate clusters, respectively. We borrow the notion for topic modeling. Informally, the Must-Link primitive prefers that two words tend to be generated by the same topic, while the Cannot-Link primitive prefers that two words tend to be generated by separate topics. However, since any topic ϕ is a multinomial over words, any two words (in general) always have some probability of being generated by the topic. We therefore propose the following definition:

Two words *u*, υ have similar probability within any topic, i.e.,
${\varphi}_{j}^{(u)}\approx {\varphi}_{j}^{(\upsilon )}$ for *j* = 1 … *T*. It is important to note that the probabilities can be both large or both small, as long as they are similar. For example, for the earlier biology example we could say Must-Link (termination, disassembly).

Two words *u*, υ should not both have large probability within any topic. It is permissible for one to have a large probability and the other small, or both small. For example, one primitive for the wish example can be Cannot-Link (college, cure).

Many types of domain knowledge can be decomposed into a set of Must-Links and Cannot-Links. We demonstrate three types in our experiments: we can **Split** two or more sets of words from a single topic into different topics by placing Must-Links within the sets and Cannot-Links between them. We can **Merge** two or more sets of words from different topics into one topic by placing Must-Links among the sets. Given a common set of words which appear in multiple topics (such as stopwords in English, which tend to appear in all LDA topics), we can **Isolate** them by placing Must-Links within the common set, and then placing Cannot-Link between the common set and the other high-probability words from all topics. It is important to note that our Must-Links and Cannot-Links are *preferences* instead of hard constraints.

It is well-known that the Dirichlet distribution is limited in that all words share a common variance parameter, and are mutually independent except the normalization constraint (Minka, 1999). However, for Must-Link (*u*, υ) it is crucial to control the two words *u*, υ differently than other words.

The Dirichlet tree distribution (Dennis III, 1991) is a generalization of the Dirichlet distribution that allows such control. It is a tree with the words as leaf nodes; see Figure 1(a) for an example. Let γ^{(k)} be the Dirichlet tree edge weight *leading into node k*. Let *C*(*k*) be the immediate children of node *k* in the tree, *L* the leaves of the tree, *I* the internal nodes, and *L*(*k*) the leaves in the subtree under *k*. To generate a sample ϕ ~ DirichletTree(γ), one first draws a multinomial at each internal node *s* *I* from Dirichlet(γ^{C(s)}), i.e., using the weights from *s* to its children as the Dirichlet parameters. One can think of it as re-distributing the probability mass reaching *s* by this multinomial (initially, the mass is 1 at the root). The probability ϕ^{(k)} of a word *k* *L* is then simply the product of the multinomial parameters on the edges from *k* to the root, as shown in Figure 1(b). It can be shown (Dennis III, 1991) that this procedure gives

$$\text{DirichletTree}(\mathrm{\gamma})\equiv p(\varphi |\mathrm{\gamma})=\left({\displaystyle \prod _{k}^{L}{\varphi}^{{(k)}^{{\mathrm{\gamma}}^{(k)}-1}}}\right)\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle \prod _{s}^{I}\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{C(s)}{\mathrm{\gamma}}^{(k)}}\right)}{{\displaystyle {\prod}_{k}^{C(s)}\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\mathrm{\gamma}}^{(k)}\right)}}{\left({\displaystyle \sum _{k}^{L(s)}{\varphi}^{(k)}}\right)}^{\mathrm{\Delta}(s)}}\right)$$

where Γ(·) is the standard gamma function, and the notation
${\prod}_{k}^{L}$ means _{kL}. The function Δ(*s*) γ^{(s)} − ∑_{kC(s)} γ^{(k)} is the difference between the in-degree and out-degree of internal node *s*. When this difference Δ(*s*) = 0 for all internal nodes *s* *I*, the Dirichlet tree reduces to a Dirichlet distribution.

Encoding Must-Links and Cannot-Links with a Dirichlet Forest. (a) A Dirichlet tree encoding Must-Link (A,B) with β = 1, η = 50 on vocabulary {A,B,C}. (b) A sample ϕ from this Dirichlet tree. (c) A large set of samples from the **...**

Like the Dirichlet, the Dirichlet tree is conjugate to the multinomial. It is possible to integrate out ϕ to get a distribution over word counts directly, similar to the multivariate Pólya distribution:

$$p(\mathbf{w}|\mathrm{\gamma})={\displaystyle \prod _{s}^{I}\phantom{\rule{thinmathspace}{0ex}}\left(\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{C(s)}{\mathrm{\gamma}}^{(k)}}\right)}{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{C(s)}\left({\mathrm{\gamma}}^{(k)}+{n}^{(k)}\right)}\right)}{\displaystyle \prod _{k}^{C(s)}\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\mathrm{\gamma}}^{(k)}+{n}^{(k)}\right)}{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\mathrm{\gamma}}^{(k)}\right)}}\right)}$$

(5)

Here *n*^{(k)} is the number of word tokens in **w** that appear in *L*(*k*).

We encode Must-Links using a Dirichlet tree. Note that our definition of Must-Link is transitive: Must-Link (*u*, υ) and Must-Link (υ, *w*) imply Must-Link (*u*, *w*). We thus first compute the transitive closures of expressed Must-Links. Our Dirichlet tree for Must-Links has a very simple structure: each transitive closure is a sub-tree, with one internal node and the words in the closure as its leaves. The weights from the internal node to its leaves are ηβ. The root connects to these internal nodes *s* with weight |*L*(*s*)|β, where | · | represents the set size. In addition, the root directly connects to other words not in any closure, with weight β. For example, the transitive closure for a Must-Link (A,B) on vocabulary {A,B,C} is simply {A,B}, corresponding to the Dirichlet tree in Figure 1(a).

To understand this encoding of Must-Links, consider first the case when the domain knowledge strength parameter is at its weakest η = 1. Then in-degree equals out-degree for any internal node *s* (both are |*L*(*s*)|β), and the tree reduces to a Dirichlet distribution with symmetric prior β: the Must-Links are turned off in this case. As we increase η, the re-distribution of probability mass at *s* (governed by a Dirichlet under *s*) has increasing *concentration* |*L*(*s*)|ηβ but the same uniform base-measure. This tends to redistribute the mass evenly in the transitive closure represented by *s*. Therefore, the Must-Links are turned on when η > 1. Furthermore, the mass *reaching s* is independent of η, and can still have a large variance. This properly encodes the fact that we want Must-Linked words to have similar, but not always large, probabilities. Otherwise, Must-Linked words would be forced to appear with large probability in *all* topics, which is clearly undesirable. This is impossible to represent with Dirichlet distributions. For example, the blue dots in Figure 1(c) are ϕ samples from the Dirichlet tree in Figure 1(a), plotted on the probability simplex of dimension three. While it is always true that *p*(*A*) ≈ *p*(*B*), their total probability mass can be anywhere from 0 to 1. The most similar Dirichlet distribution is perhaps the one with parameters (50,50,1), which generates samples close to (0.5, 0.5, 0) (Figure 1(d).)

Cannot-Links are considerably harder to handle. We first transform them into an alternative form that is amenable to Dirichlet trees. Note that Cannot-Links are not transitive: Cannot-Link (A,B) and Cannot-Link (B,C) does not entail Cannot-Link (A,C). We define a Cannot-Link-graph where the nodes are words^{1}, and the edges correspond to the Cannot-Links. Then the *connected components* of this graph are independent of each other when encoding Cannot-Links. We will use this property to factor a Dirichlet-tree selection probability later. For example, the two Cannot-Links (A,B) and (B,C) form the graph in Figure 1(e) with a single connected component {A,B,C}.

Consider the subgraph on connected component *r*. We define its *complement graph* by flipping the edges (on to off, off to on), as shown in Figure 1(f). Let there be *Q*^{(r)} *maximal cliques* *M*_{r1} … *M*_{rQ(r)} in this complement graph. In the following, we simply call them “cliques”, but it is important to remember that they are maximal cliques of the complement graph, not the original Cannot-Link-graph. In our example, *Q*^{(r)} = 2 and *M*_{r1} = {*A*, *C*}, *M*_{r2} = {*B*}. These cliques have the following interpretation: each clique (e.g., *M*_{r1} = {*A*, *C*}) is the maximal subset of words in the connected component that can “occur together”. That is, these words are allowed to simultaneously have large probabilities in a given topic without violating any Cannot-Link preferences. By the maximality of these cliques, allowing any word outside the clique (e.g., “B”) to also have a large probability will violate at least 1 Cannot-Link (in this example 2).

We discuss the encoding for this single connected component *r* now, deferring discussion of the complete encoding to section 3.4. We create a mixture model of *Q*^{(r)} Dirichlet subtrees, one for each clique. Each topic selects exactly one subtree according to probability

$$p(q)\phantom{\rule{thinmathspace}{0ex}}\propto \phantom{\rule{thinmathspace}{0ex}}|{M}_{rq}|,\text{}q=1\dots {Q}^{(r)}.$$

(6)

Conceptually, the selected subtree indexed by *q* tends to redistribute nearly all probability mass to the words within *M _{rq}*. Since there is no mass left for other cliques, it is impossible for a word outside clique

Finally, we mention that although in the worst case the number of maximal cliques *Q*^{(r)} in a connected component of size |*r*| can grow exponentially as *O*(3^{|r|/3}) (Griggs et al., 1988), in our experiments *Q*^{(r)} is no larger than 3, due in part to Must-Linked words “collapsing” to single nodes in the Cannot-Link graph.

In general, our domain knowledge is expressed by a set of Must-Links and Cannot-Links. We first compute the transitive closure of Must-Links. We then form a Cannot-Link-graph, where a node is either a Must-Link closure or a word not present in any Must-Link. Note that the domain knowledge must be “consistent” in that no pairs of words are simultaneously Cannot-Linked and Must-Linked (either explicitly or implicitly through Must-Link transitive closure.) Let *R* be the number of connected components in the Cannot-Link-graph. Our Dirichlet Forest consists of
$\prod}_{r=1}^{R}{Q}^{(r)$ Dirichlet trees, represented by the template in Figure 2. Each Dirichlet tree has *R* branches beneath the root, one for each connected component. The trees differ in which subtrees they include under these branches. For the *r*-th branch, there are *Q*^{(r)} possible Dirichlet subtrees, corresponding to cliques *M*_{r1} … *M*_{rQ(r)}. Therefore, a tree in the forest is uniquely identified by an index vector **q** = (*q*^{(1)} … *q*^{(R)}), where *q*^{(r)} {1 … *Q*^{(r)}}.

To draw a Dirichlet tree **q** from the prior DirichletForest(β, η), we select the subtrees independently because the *R* connected components are independent with respect to Cannot-Links:
$p(\mathbf{q})={\displaystyle {\prod}_{r=1}^{R}p({q}^{(r)})}$. Each *q*^{(r)} is sampled according to (6), and corresponds to choosing a solid box for the *r*-th branch in Figure 2. The structure of the subtree within the solid box has been defined in Section 3.3. The black nodes may be a single word, or a Must-Link transitive closure having the subtree structure shown in the dotted box. The edge weight leading to most nodes *k* is γ^{(k)} = |*L*(*k*)|β, where *L*(*k*) is the set of leaves under *k*. However, for edges coming out of a Must-Link internal node or going into a Cannot-Link internal node, their weights are multiplied by the strength parameter η. These edges are marked by “η*” in Figure 2.

We now define the complete Dirichlet Forest model, integrating out (“collapsing”) θ and ϕ. Let
${n}_{j}^{(d)}$ be the number of word tokens in document *d* that are assigned to topic *j*. **z** is generated the same as in LDA:

$$p(\mathbf{z}|\alpha )={\left(\frac{\mathrm{\Gamma}(T\alpha )}{\mathrm{\Gamma}{(\alpha )}^{T}}\right)}^{D}{\displaystyle \prod _{d=1}^{D}\frac{{\displaystyle {\prod}_{j=1}^{T}\mathrm{\Gamma}({n}_{j}^{(d)}+\alpha )}}{\mathrm{\Gamma}({n}_{.}^{(d)}+T\alpha )}.}$$

There is one Dirichlet tree **q**_{j} per topic *j* = 1 … *T*, sampled from the Dirichlet Forest prior
$p({\mathbf{q}}_{j})={\displaystyle {\prod}_{r=1}^{R}p({q}_{j}^{(r)})}$. Each Dirichlet tree **q**_{j} implicitly defines its tree edge weights
${\mathrm{\gamma}}_{j}^{(\xb7)}$ using β, η, and its tree structure *L _{j}*,

$$p(\mathbf{w}|{\mathbf{q}}_{1:T},\mathbf{z},\beta ,\eta )={\displaystyle \prod _{j=1}^{T}{\displaystyle \prod _{s}^{{I}_{j}}\phantom{\rule{thinmathspace}{0ex}}\left(\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{{C}_{j}(s)}{\mathrm{\gamma}}_{j}^{(k)}}\right)}{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{{C}_{j}(s)}({\mathrm{\gamma}}_{j}^{(k)}+{n}_{j}^{(k)})}\right)}{\displaystyle \prod _{k}^{{C}_{j}(s)}\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}({\mathrm{\gamma}}_{j}^{(k)}+{n}_{j}^{(k)})}{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}({\mathrm{\gamma}}_{j}^{(k)})}}\right)\phantom{\rule{thinmathspace}{0ex}}.}}$$

Finally, the complete generative model is

$$p(\mathbf{w},\mathbf{z},{\mathbf{q}}_{1:T}|\alpha ,\beta ,\eta )=p(\mathbf{w}|{\mathbf{q}}_{1:T},\mathbf{z},\beta ,\eta )p(\mathbf{z}|\alpha ){\displaystyle \prod _{j=1}^{T}p({\mathbf{q}}_{j}).}$$

Because a Dirichlet Forest is a mixture of Dirichlet trees, which are conjugate to multinomials, we can efficiently perform inference by Markov Chain Monte Carlo (MCMC). Specifically, we use collapsed Gibbs sampling similar to Griffiths and Steyvers (2004). However, in our case the MCMC state is defined by both the topic labels **z** and the tree indices **q**_{1:T}. An MCMC iteration in our case consists of a sweep through both **z** and **q**_{1:T}. We present the conditional probabilities for collapsed Gibbs sampling below.

(**Sampling** *z _{i}*

$$p({z}_{i}=\upsilon |{\mathbf{z}}_{-i},{\mathbf{q}}_{1:T},\mathbf{w})\propto ({n}_{-i,\upsilon}^{(d)}+\alpha ){\displaystyle \prod _{s}^{{I}_{\upsilon}(\uparrow i)}\frac{{\mathrm{\gamma}}_{\upsilon}^{({C}_{\upsilon}(s\downarrow i))}+{n}_{-i,\upsilon}^{({C}_{\upsilon}(s\downarrow i))}}{{\displaystyle {\sum}_{k}^{{C}_{\upsilon}(s)}\left({\mathrm{\gamma}}_{\upsilon}^{(k)}+{n}_{-i,\upsilon}^{(k)}\right)}},}$$

where *I*_{υ}(↑ *i*) denotes the subset of internal nodes in topic υ’s Dirichlet tree that are ancestors of leaf *w _{i}*, and

(**Sampling**
${q}_{j}^{(r)}$**):** Since the connected components are independent, sampling the tree **q**_{j} factors into sampling the cliques for each connected component
${q}_{j}^{(r)}$. For candidate cliques *q*′ = 1 … *Q*(*r*), we have

$$p({q}_{j}^{(r)}=q\prime |\mathbf{z},{\mathbf{q}}_{-j},{\mathbf{q}}_{j}^{(-r)},\mathbf{w})\propto \left({\displaystyle \sum _{k}^{{M}_{rq\prime}}{\beta}_{k}}\right)\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}{\displaystyle \prod _{s}^{{I}_{j,r=q\prime}}\left(\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{{C}_{j}(s)}{\mathrm{\gamma}}_{j}^{(k)}}\right)}{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle {\sum}_{k}^{{C}_{j}(s)}({\mathrm{\gamma}}_{j}^{(k)}}+{n}_{j}^{(k)})\right)}{\displaystyle \prod _{k}^{{C}_{j}(s)}\frac{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}({\mathrm{\gamma}}_{j}^{(k)}+{n}_{j}^{(k)})}{\mathrm{\Gamma}\phantom{\rule{thinmathspace}{0ex}}({\mathrm{\gamma}}_{j}^{(k)})}}\right)}$$

where *I*_{j,r=q′} denotes the internal nodes below the *r*-th branch of tree **q**_{j}, when clique *M*_{rq′} is selected.

(**Estimating** ϕ **and** θ**):** After running MCMC for sufficient iterations, we follow standard practice (e.g. (Griffiths & Steyvers, 2004)) and use the last sample (**z**, **q**_{1:T}) to estimate ϕ and θ. Because a Dirichlet tree is a conjugate distribution, its posterior is a Dirichlet tree with the same structure and updated edge weights. The posterior for the Dirichlet tree of the *j*-th topic is
${{\mathrm{\gamma}}^{\mathit{\text{post}}}}_{j}^{(k)}={\mathrm{\gamma}}_{j}^{(k)}+{n}_{j}^{(k)}$, where the counts
${n}_{j}^{(k)}$ are collected from **z**, **q**_{1:T}, **w**. We estimate ϕ_{j} by the first moment under this posterior (Minka, 1999):

$${\widehat{\varphi}}_{j}^{(w)}={\displaystyle \prod _{s}^{{I}_{j}(\uparrow w)}{{\mathrm{\gamma}}^{\mathit{\text{post}}}}_{j}^{({C}_{j}(s\downarrow w))}\phantom{\rule{thinmathspace}{0ex}}{\left({\displaystyle \sum _{s\prime}^{{C}_{j}(s)}{{\mathrm{\gamma}}^{\mathit{\text{post}}}}_{j}^{(s\prime )}}\right)}^{-1}.}$$

(7)

The parameter θ is estimated the same way as in standard LDA: ${\widehat{\theta}}_{j}^{(d)}=({n}_{j}^{(d)}+\alpha )/({n}_{.}^{(d)}+T\alpha )$.

We present results on synthetic datasets to show how the Dirichlet Forest (DF) incorporates different types of knowledge. Recall that DF with η = 1 is equivalent to standard LDA (verified with the code of (Griffiths & Steyvers, 2004)).

Previous studies often take the last MCMC sample (**z** and **q**_{1:T}), and discuss the topics ϕ_{1:T} derived from that sample. Because of the stochastic nature of MCMC, we argue that more insight can be gained if multiple independent MCMC samples are considered. For each dataset, and each DF with a different η, we run a long MCMC chain with 200,000 iterations of burn-in, and take out a sample every 10,000 iterations afterward, for a total of 200 samples. We have some indication that our chain is well-mixed, as we observe all expected modes, and that samples with “label switching” (i.e., equivalent up to label permutation) occur with near equal frequency. For each sample, we derive its topics ϕ_{1:T} with (7) and then greedily align the ϕ’s from different samples, permuting the *T* topic labels to remove the label switching effect. Within a dataset, we perform PCA on the baseline (η = 1) ϕ and project all samples into the resulting space to obtain a common visualization (each row in Figure 3. Points are dithered to show overlap.).

The corpus consists of six documents over a vocabulary of five “words.” The documents are: ABAB, CDCD, and EEEE, each represented twice. We let *T* = 2, α = 0.5, β = 0.01. LDA produces three kinds of ϕ_{1:T}: roughly a third of the time the topics are around
$[\frac{A}{2}\frac{B}{2}|\frac{C}{4}\frac{D}{4}\frac{E}{2}]$, which is shorthand for
${\varphi}_{1}=(\frac{1}{2},\frac{1}{2},0,0,0)\phantom{\rule{thinmathspace}{0ex}}{\varphi}_{2}=(0,0,\frac{1}{4},\frac{1}{4},\frac{1}{2})$ on the vocabulary ABCDE. Another third are around
$[\frac{A}{4}\frac{B}{4}\frac{E}{2}|\frac{C}{2}\frac{D}{2}]$, and the final third around
$[\frac{A}{4}\frac{B}{4}\frac{C}{4}\frac{D}{4}|E]$. They correspond to clusters 1,2 and 3 respectively in the upper-left panel of Figure 3. We add a single Must-Link (B,C). When η = 10, the data still override our Must-Link somewhat because clusters 1 and 2 do not disappear completely. As η increases to 50, Must-Link overrides the data and clusters 1 and 2 vanish, leaving only cluster 3. That is, running DF and taking the last sample is very likely to obtain the
$[\frac{A}{4}\frac{B}{4}\frac{C}{4}\frac{D}{4}|E]$ topics. This is what we want: B and C are present or absent *together* in the topics and they also “pull” A, D along, even though A, D are not in the knowledge we added.

The corpus has four documents: ABCCABCC, ABDDABDD, twice each; *T* = 3, α = 1, β = 0.01. LDA produces six kinds of ϕ_{1:T} evenly:
$[\frac{B}{2}\frac{D}{2}|A|C],[\frac{A}{2}\frac{B}{2}|C|D],[\frac{A}{2}\frac{D}{2}|B|C],[\frac{B}{2}\frac{C}{2}|A|D],[\frac{A}{2}\frac{C}{2}|B|D],[\frac{C}{2}\frac{D}{2}|A|B]$, corresponding to clusters 1–5 and the “lines”. We add a single Cannot-Link (A,B). As DF η increases, cluster 2
$[\frac{A}{2}\frac{B}{2}|C|D]$ disappears, because it involves a topic
$\frac{A}{2}\frac{B}{2}$ that violates the Cannot-Link. Other clusters become uniformly more likely.

The corpus has four documents, all of which are ABC; *T* = 2, α = 1, β = 0.01. LDA produces three clusters evenly:
$[\frac{A}{2}\frac{C}{2}|B],[\frac{A}{2}\frac{B}{2}|C],[\frac{B}{2}\frac{C}{2}|A]$. We add **Isolate**(B), which is compiled into Cannot-Link (B,A) and Cannot-Link (B,C). The DF’s samples concentrate to cluster 1:
$[\frac{A}{2}\frac{C}{2}|B]$, which indeed isolates B into its own topic.

The corpus has six documents: ABCDEEEE, ABCDFFFF, each present three times; α = 0.5, β = 0.01. LDA with *T* = 3 produces a large portion of topics around
$[\frac{A}{4}\frac{B}{4}\frac{C}{4}\frac{D}{4}|E|F]$ (not shown). We add **Split**(AB,CD), which is compiled into Must-Link (A,B), Must-Link (C,D), Cannot-Link (B,C), and increase *T* = 4. However, DF with η = 1 (i.e., LDA with *T* = 4) produces a large variety of topics: e.g., cluster 1 is
$[\frac{A}{4}\frac{3B}{8}\frac{3D}{8}|\frac{A}{8}\frac{7F}{8}|C|E]$, cluster 2 is
$[\frac{C}{8}\frac{7D}{8}|\frac{3A}{8}\frac{3B}{8}\frac{C}{4}|E|F]$, and cluster 7 is
$[\frac{A}{2}\frac{B}{2}|\frac{C}{2}\frac{D}{2}|E|F]$. That is, simply adding one more topic does not clearly separate AB and CD. On the other hand, with η increasing, DF eventually concentrates on cluster 7, which satisfies the Split operation.

We now consider *interactive topic modeling* with DF. The corpus we use is a collection of 89,574 New Year’s wishes submitted to The Times Square Alliance (Goldberg et al., 2009). Each wish is treated as a document, downcased but without stop-word removal. For each step in our interactive example, we set α = 0.5, β = 0.1, η = 1000, and run MCMC for 2000 iterations before estimating the topics from the final sample. The domain knowledge in DF is accumulative along the steps.

**Step 1:** We run LDA with *T* = 15. Many of the most probable words in the topics are conventional (“to, and”) or corpus-specific (“wish, 2008”) stop-words, which obscure the meaning of the topics.

**Step 2:** We manually create a 50-word stopword list, and issue an **Isolate** preference. This is compiled into Must-Links among this set and Cannot-Links between this set and all other words in the top 50 for all topics. *T* is increased to 16. After running DF, we end up with two stop-word topics. Importantly, with the stop-words explained by these two topics, the top words for the other topics become much more meaningful.

**Step 3:** We notice that one topic conflates two concepts: enter college and cure disease (top 8 words: “go school cancer into well free cure college”). We issue **Split**(“go,school,into,college”, “cancer,free,cure,well”) to separate the concepts. This is compiled into Must-Links within each quadruple, and a Cannot-Link between them. *T* is increased to 18. After running DF, one of the topics clearly takes on the “college” concept, picking up related words which we did not explicitly encode in our prior. Another topic does likewise for the “cure” concept (many wishes are like “mom stays cancer free”). Other topics have minor changes.

**Step 4:** We then notice that two topics correspond to romance concepts. We apply **Merge**(“love, forever, marry, together, loves”, “meet, boyfriend, married, girlfriend, wedding”), which is compiled into Must-Links between these words. *T* is decreased to 17. After running DF, one of the romance topics disappears, and the remaining one corresponds to the merged romance topic (“lose”, “weight” were in one of them, and remain so). Other previous topics survive with only minor changes. Table 1 shows the wish topics after these four steps, where we place the DF operations next to the most affected topics, and color-code the words explicitly specified in the domain knowledge.

Whereas the previous experiment illustrates the utility of our approach in an interactive setting, we now consider a case in which we use background knowledge from an ontology to guide topic modeling. Our prior knowledge is based on six concepts. The concepts `transcription`, `translation` and `replication` characterize three important *processes* that are carried out at the molecular level. The concepts `initiation`, `elongation` and `termination` describe *phases* of the three aforementioned processes. Combinations of concepts from these two sets correspond to concepts in the Gene Ontology (e.g., GO:0006414 is `translational elongation`, and GO:0006352 is `transcription initiation`). We guide our topic modeling using Must-Links among a small set of words for each concept. Moreover, we use Cannot-Links among words to specify that we prefer (i) `transcription`, `translation` and `replication` to be represented in separate topics, and (ii) `initiation`, `elongation` and `termination` to be represented in separate topics. We do not set any preferences between the “process” topics and the “phase” topics, however.

The corpus that we use for our experiments consists of 18,193 abstracts selected from the MEDLINE database for their relevance to yeast genes. We induce topic models using DF to encode the Must-Links and Cannot-Links described above, and use standard LDA as a control. We set *T* = 100, α = 0.5, β = 0.1, η = 5000. For each word that we use to seed a concept, Table 2 shows the topics that include it among their 50 most probable words. We make several observations about the DF-induced topics. First, each concept is represented by a small number of topics and the Must-Link words for each topic all occur as highly probable words in these topics. Second, the Cannot-Link preferences are obeyed in the final topics. Third, the topics use the process and phase topics in a compositionally. For example, DF Topic 4 represents `transcription initiation` and DF Topic 8 represents `replication initiation`. Moreover, the topics that are significantly influenced by the prior typically include highly relevant terms among their most probable words. For example, the top words in DF Topic 4 include “TATA”, “TFIID”, “promoter”, and “recruitment” which are all specifically germane to the composite concept of `transcription initiation`. In the case of standard LDA, the seed concept words are dispersed across a greater number of topics, and highly related words, such as “cycle” and “division” often do not fall into the same topic. Many of the topics induced by ordinary LDA are semantically coherent, but the specific concepts suggested by our prior do not naturally emerge without using DF.

This work was supported by NIH/NLM grants T15 LM07359 and R01 LM07050, and the Wisconsin Alumni Research Foundation.

Appearing in *Proceedings of the 26 ^{th} International Conference on Machine Learning*, Montreal, Canada, 2009.

^{1}When there are Must-Links, all words in a Must-Link transitive closure form a single node in this graph.

^{2}Dirichlet distributions with very small concentration do have some selection effect. For example, Beta(0.1,0.1) tends to concentrate probability mass on one of the two variables. However, such priors are weak – the “pseudo counts” in them are too small because of the small concentration. The posterior will be dominated by the data, and we would lose any encoded domain knowledge.

- Basu S, Davidson I, Wagstaff K, editors. Constrained clustering: Advances in algorithms, theory, and applications. Chapman & Hall/CRC; 2008.
- Blei D, Lafferty J. Advances in neural information processing systems. Vol. 18. Cambridge, MA: MIT Press; 2006. Correlated topic models; pp. 147–154.
- Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003;3:993–1022.
- Chemudugunta C, Holloway A, Smyth P, Steyvers M. Modeling documents by combining semantic concepts with unsupervised statistical learning; Intl. Semantic Web Conf; Springer; 2008. pp. 229–244.
- Dennis SY., III On the hyper-Dirichlet type 1 and hyper-Liouville distributions. Communications in Statistics – Theory and Methods. 1991;20:4069–4081.
- Goldberg A, Fillmore N, Andrzejewski D, Xu Z, Gibson B, Zhu X. May all your wishes come true: A study of wishes and how to recognize them; Human Language Technologies: Proc. of the Annual Conf. of the North American Chapter of the Assoc. for Computational Linguistics; ACL Press; 2009.
- Griffths TL, Steyvers M. Finding scientific topics. Proc. of the Nat. Academy of Sciences of the United States of America. 2004;101:5228–5235. [PubMed]
- Griggs JR, Grinstead CM, Guichard DR. The number of maximal independent sets in a connected graph. Discrete Math. 1988;68:211–220.
- Li W, McCallum A. Pachinko allocation: DAG-structured mixture models of topic correlations; Proc. of the 23rd Intl. Conf. on Machine Learning; ACM Press; 2006. pp. 577–584.
- Minka TP. The Dirichlet-tree distribution (Technical Report) 1999. http://research.microsoft.com/~minka/papers/dirichlet/minka-dirtree.pdf.
- Tam Y-C, Schultz T. Correlated latent semantic model for unsupervised LM adaptation; IEEE Intl. Conf. on Acoustics, Speech and Signal Processing; 2007. pp. 41–44.
- The Gene Ontology Consortium. Gene Ontology: Tool for the unification of biology. Nature Genetics. 2000;25:25–29. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |