Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4874344

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Optional Pólya Tree and its inference
- 3 Acceleration of computation
- 4 Time complexity analysis for OPT and LL-OPT
- 5 Estimation of smooth density
- 6 Examples
- 7 Discussion
- Supplementary Material
- References

Authors

Related links

J Comput Graph Stat. Author manuscript; available in PMC 2017 March 9.

Published in final edited form as:

J Comput Graph Stat. 2016; 25(1): 301–320.

Published online 2016 March 9. doi: 10.1080/10618600.2014.1002927PMCID: PMC4874344

NIHMSID: NIHMS785143

Hui Jiang, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109;

John Chong Mu: moc.johnmu@umnhoj; Kun Yang: ude.drofnats@gnaynuk; Chao Du: ude.ainigriv@bw2dc; Luo Lu: moc.rettiwt@ull

John Chong MU, Present Address: Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065

Optional Pólya tree (OPT) is a flexible nonparametric Bayesian prior for density estimation. Despite its merits, the computation for OPT inference is challenging. In this paper we present time complexity analysis for OPT inference and propose two algorithmic improvements. The first improvement, named limited-lookahead optional Pólya tree (LL-OPT), aims at accelerating the computation for OPT inference. The second improvement modifies the output of OPT or LL-OPT and produces a continuous piecewise linear density estimate. We demonstrate the performance of these two improvements using simulated and real date examples.

Given independent random samples, characterizing their distribution is one of the most fundamental problems in statistics. Density estimation is a key approach to solving such problems and is closely related to other problems such as clustering, classification and regression [Silverman, 1986]. Kernel density estimation [Rosenblatt, 1956, Parzen, 1962] and data-driven histograms [Scott, 1979, Barron et al., 1992] are popular methods for density estimation. Among many approaches developed for density estimation, nonparametric Bayesian models have the merits of being highly flexible and computationally feasible. In a seminal paper, Ferguson [1973] proposed a class of prior distributions named Dirichlet process priors which have flexible structures and analytically tractable posterior distributions which make them very popular for nonparametric Bayesian modeling. However, a random probability measure sampled from a Dirichlet process prior is discrete with probability one, which makes it unsuitable for density estimation. Using the Dirichlet process to model a infinite mixture of distributions gives rise to absolutely continuous distributions [Lo, 1984]. In practice, Dirichlet process Gaussian mixture models (DP-GMM) is commonly used for density estimation [Escobar and West, 1995]. Detailed discussions of Dirichlet process and its extensions as well as other Bayesian nonparametric models can be found in Ghosh and Ramamoorthi [2003].

Proposed by Ferguson [1974] and studied in details by Lavine [1992], Pólya tree priors are extensions of Dirichlet processes [Ferguson, 1974] and special cases of tailfree processes [Freedman, 1963]. A Pólya tree defines a prior distribution which gives rise to random probability measures on a separable measurable space (Ω, (Ω)), with respect to a base measure *μ*. In this paper we assume that Ω is a bounded hyperrectangle in * ^{p}*, (Ω) is the Borel

The Pólya tree prior [Lavine, 1992] works as follows: starting from the the root region *A* = Ω, it defines a partition of *A* as *A* = *A*_{0} *A*_{1}, where *A*_{0} and *A*_{1} are disjoint. It then defines partitions for *A*_{0} and *A*_{1} as *A*_{0} = *A*_{00} *A*_{01} and *A*_{1} = *A*_{10} *A*_{11}, respectively. This partition process recursively goes on forever: for any binary sequence *s* = *s*_{1} … *s _{k}* where

$$Q({A}_{s})=\left(\prod _{j=1;{s}_{j}=0}^{k}{Y}_{{s}_{1}\dots {s}_{j-1}}\right)(\prod _{j=1;{s}_{j}=1}^{k}(1-{Y}_{{s}_{1}\dots {s}_{j-1}}))$$

for any subregion *A _{s}*, and let

In order for the Pólya tree prior to produce absolutely continuous measures, the Beta parameters *α _{s}* have to increase rapidly as the depth of partitioning increases [Kraft, 1964, Mauldin et al., 1992]. Furthermore, because the partitioning process goes on forever, with probability one the density sampled from a Pólya tree prior will have discontinuity almost everywhere. To solve these problems, Wong and Ma [2010] proposed optional Pólya tree (OPT), which is an extension to the Pólya tree prior that allows for optional stopping and random selection of partitioning variables. It has three attractive properties: 1) It fits the data adaptively; 2) It produces absolutely continuous distributions; and 3) The computation for exact OPT inference can be done in finite steps, providing there are no replicates in the data. OPT has been successfully applied in both one-sample [Wong and Ma, 2010] and two-sample [Ma and Wong, 2011] settings.

Albeit all these merits, there are existing issues which may hinder the application of OPT. Two major issues are: 1) The running time for exact OPT inference increases rapidly as sample size and the dimension of the data increase, which becomes prohibitive when the dimension is high; 2) The posterior mean of OPT is usually computationally prohibitive to estimate [Wong and Ma, 2010], and although the hierarchical maximum a posteriori (hMAP) estimate of OPT proposed by Wong and Ma [2010] is computationally more friendly, it gives rise to piecewise constant density estimates where discontinuities occur at boundaries between the pieces which could become undesirable for applications where continuous densities are sought.

In this paper we attempt to tackle the above two problems. In particular, we demonstrate that the computation for OPT inference can be greatly accelerated by performing approximate inference. We also present an approach which modifies the output of OPT and produces a continuous piecewise linear density estimate. This paper is organized as follows: In Section 2 we review the construction of OPT and the algorithm for its exact inference. In Section 3 we present an approach for reducing the computation named limited-lookahead optional Pólya tree (LL-OPT). Section 4 gives the time complexity analysis for OPT and LL-OPT. In Section 5 we present an approach for the estimation of smooth densities. Simulated and real data examples are given in Section 6. Further discussions and software availability are given in Section 7.

A personal computer with Intel Xeon X7560 2.27Ghz on a single core was used for all computations reported in this paper.

Similar to Pólya trees, OPT operates simultaneously as two processes: the first process recursively and randomly partitions the space Ω and the second process recursively assigns a random probability measure into each region in the partition generated by the first process.

The recursive partitioning process works as follows: starting from the root region *A* = Ω, a random binary sample *S* is drawn according to Bernoulli(*ρ*(*A*)) where *ρ*(*A*) is a parameter of the OPT. If *S* = 1, we stop further partitioning *A* and mark it as a terminal region. If *S* = 0, a random integer *J* {1, …, *J _{A}*} is drawn according to a probability vector

$$\begin{array}{c}A=\underset{i=1}{\overset{{I}_{A}^{j}}{\cup}}{A}_{i}^{j},\\ \forall i,k\in \{1,\dots ,{I}_{A}^{j}\},\text{if}\phantom{\rule{0.2em}{0ex}}i\ne k,\text{then}\phantom{\rule{0.2em}{0ex}}{A}_{i}^{j}\cap {A}_{k}^{j}=\varnothing .\end{array}$$

The partitioning process is then recursively applied to each subregion
${A}_{i}^{j}$ of *A*, and consequently, a partition tree is built by the process recursively. The partitioning process terminates when all the regions in the partition are marked as terminal regions.

The process for random assignment of probability measures works along with the partitioning process. It begins with a probability measure *Q*^{(0)} which is uniform within the root region *A* = Ω. Whenever a region *A* is partitioned into subregions
${\{{A}_{i}^{j}\}}_{i=1}^{{I}_{A}^{j}}$, a random vector
${\mathit{\theta}}_{A}^{j}=({\theta}_{A,1}^{j},\dots ,{\theta}_{A,{I}_{A}^{j}}^{j})$ is drawn from a Dirichlet distribution with parameter
${\mathit{\alpha}}^{j}(A)=({\alpha}_{1}^{j}(A),\dots ,{\alpha}_{{I}_{A}^{j}}^{j}(A))$ and then a new probability measure *Q*^{(k+1)} is built based on *Q*^{(k)} by reassigning the probability mass within *A* according to
${Q}^{(k+1)}({A}_{i}^{j})={\theta}_{A,i}^{j}{Q}^{(k)}(A)$, were *k* = *k*(*A*) is the level of region *A* in the partition tree with *k*(Ω) = 0. Within each subregion
${A}_{i}^{j}$, the probability mass is again uniformly distributed.

For the OPT procedure described above, there exists a limiting probability measure for *Q*^{(k)} [Wong and Ma, 2010, Theorem 1]: If *ρ*(*A*) is uniformly bounded away from 0 and 1 for all *A*, then *Q*^{(k)} converges almost surely in variational distance to a probability measure *Q* which is absolutely continuous with respect to *μ*. The density of *Q* with respect to *μ* on Ω is piecewise constant with countably many pieces.

As a prior distribution on the space of probability measures, one merit of OPT is that the computation for exact OPT inference is analytically manageable. Given independent random samples ** x** = (

- Stopping probabilities:
*ρ*(*A*|) =*x**ρ*(*A*)Φ_{0}(*A*)/Φ(*A*), - Selection probabilities:$${\lambda}_{j}(A|\mathit{x})\propto {\lambda}_{j}(A)\frac{D({\mathit{n}}_{A}^{j}+{\mathit{\alpha}}^{j}(A))}{D({\mathit{\alpha}}^{j}(A))}\prod _{i=1}^{{I}_{A}^{j}}\mathrm{\Phi}({A}_{i}^{j}),j=1,\dots ,{J}_{A},$$
- Probability mass allocation:$${\alpha}^{j}(A|\mathit{x})={\mathit{\alpha}}^{j}(A)+{\mathit{n}}_{A}^{j},j=1,\dots ,{J}_{A},$$

where
${\mathit{n}}_{A}^{j}=({n}_{A,1}^{j},\dots ,{n}_{A,{I}_{A}^{j}}^{j})$ are the numbers of samples in ** x** falling into each regions of the partition,

$$\mathrm{\Phi}(A)=\rho (A){\mathrm{\Phi}}_{0}(A)+(1-\rho (A))\sum _{j=1}^{{J}_{A}}{\lambda}_{j}(A)\frac{D({\mathit{n}}_{A}^{j}+{\mathit{\alpha}}^{j}(A))}{D({\mathit{\alpha}}^{j}(A))}\prod _{i=1}^{{I}_{A}^{j}}\mathrm{\Phi}({A}_{i}^{j}).$$

(2.1)

There are many different schemes for partitioning the regions. In this paper we consider a simple binary partitioning scheme proposed in Wong and Ma [2010]: For *A* = {(*t*_{1}, … *t _{p}*) :

According to (2.1), exact OPT inference requires computing Φ(*A*) recursively, which is computationally challenging since the running time increases rapidly as the dimension of the data increases. This is evident based on the fact that in (2.1) the value of Φ(*A*) has to be computed for every region *A* with at least one data point and the number of such regions is enormous when the dimension is high. Figure 1 shows the running time for OPT inference for samples of various sizes and dimensions, using a fast algorithm for exact OPT inference which will be introduced Section 4. We can see that the running time increases nearly exponentially with respect to the dimension of the data and it becomes prohibitive when the dimension is 8 or higher. The analysis of the time complexity for OPT inference is given in Section 4.

The running time for exact OPT inference using the cached implementation (See Section 4). Running times (in seconds) are plotted against different dimensions with sample size fixed at 1, 000 (left figure) as well as different sample sizes with number **...**

The computation for OPT inference can be greatly accelerated by performing the inference approximately. In a straightforward approach, named naive inexact optional Pólya tree (NI-OPT), we stop the recursion in (2.1) when the number of samples in a region *A* is small or when the volume of *A* is small, and compute Φ(*A*) by assuming that all the samples in *A* are uniformly distributed and therefore no further partitioning of *A* is necessary. The rationale for employing these heuristics is that when the volume of *A* is small, error in Φ(*A*) is only likely to affect the estimated density locally, and when the number of samples in *A* is small, Φ(*A*) would not be too different from Φ′(*A*) which is computed by redistributing the samples in *A* uniformly.

In practice, we found that the running time for the NI-OPT approach, although significantly reduced compared with the exact OPT approach, is still prohibitive when the sample size is large and when the dimension is high. In this paper we propose a more efficient algorithm for approximate OPT inference, named limited-lookahead optional Pólya tree (LL-OPT), which works as follows: starting from the root region *A* = Ω, we perform the recursive computation in (2.1) for up to *h* levels in the partition tree, where *h* is a tuning parameter which provides a trade-off between running time and estimation accuracy. For a region *B* at level *k*(*B*) = *k*(*A*) + *h*, we stop the recursion and approximate Φ(*B*) as Φ′(*B*) using the uniform approximation as we did in NI-OPT. Based on the computed Φ(*A*) and Φ(*A _{s}*) where

There are some connections to existing methods for Pólya tree inference. Lavine [1992] studied scenarios where exact inference for Pólya tree can be achieved, and Mauldin et al. [1992], Lavine [1994] studied scenarios where the the inference for Pólya tree is only performed upto a given level under which errors of approximation can be estimated or bounded. Mixture of Pólya tree priors (MPT) is another approach for estimating smooth densities [Lavine, 1992]. The idea of early stopping was first proposed by Hutter [2009]. Bayesian recursive partitioning has also been proposed previously as a classification problem [Denison et al., 1998].

In this section we study the time complexity for the exact OPT and the LL-OPT algorithms. For simplicity we consider only the simple binary partitioning scheme described in Section 2. However, the analysis technique may be adapted to more complicated partitioning schemes. For each region *A*, the exact OPT algorithm according to (2.1) has to undertake the following three computational tasks:

- For each subregion ${A}_{i}^{j}$ of
*A*,*j*= 1, …,*p*,*i*= 1, 2, determine ${n}_{A,i}^{j}$, the counts of samples within ${A}_{i}^{j}$. - For each subregion ${A}_{i}^{j}$ of
*A*, compute $\mathrm{\Phi}({A}_{i}^{j})$ recursively using (2.1). - Calculate Φ(
*A*) according to (2.1) based on the counts found in task 1 and $\mathrm{\Phi}({A}_{i}^{j})$ computed for each subregion ${A}_{i}^{j}$ in task 2.

For task 1, as *n* → ∞, the straightforward way of counting the number of samples in each of the 2*p* subregions of *A* will take *O*(2*pN*) operations, where *N* is the total number of samples in Ω. However, this cost can be reduced by trading space for time. If we can store the set of samples in *A*, counting the number of samples in each of the 2*p* subregions of *A* will only take *O*(2*pn*) operations, where *n* is the number of samples in *A*. Let *f*(*n*) denote the time for computing Φ(*A*) for a region *A* with *n* samples. The time complexity for task 2 is therefore
${\sum}_{j=1}^{p}\{f({n}_{A,1}^{j})+f({n}_{A,2}^{j})\}$. For task 3, the time complexity is *O*(2*p*) if we assume that function *D*(**t**) can be computed in *O*(1) time. The time complexity for computing Φ(*A*) is therefore

$$\begin{array}{c}f(n)=O(2pn)+{\sum}_{j=1}^{p}\{f({n}_{A,1}^{j})+f({n}_{A,2}^{j})\}+O(2p)\\ =O(pn)+{\sum}_{j=1}^{p}\{f({n}_{A,1}^{j})+f({n}_{A,2}^{j})\}\end{array}$$

(4.1)

It is shown in (4.1) that the value of *f*(*n*) depends on the values of
${n}_{A,1}^{j}$ and
${n}_{A,2}^{j}$, which are determined by the distribution of samples within *A*. The more uniformly the samples are distributed in *A*, the less the computation time is required. Here we estimate *f*(*n*) for a range of scenarios.

**Theorem 4.1.** For the following three cases, the time complexities for the exact OPT algorithm are *f*(*n*) = *O*(*pn*^{log1/t2p}) (case 1), *f*(*n*) = *O*(*n*^{2p−1}) (case 2) and *f*(*n*) = *O*(*p ^{n/c}*) (case 3), respectively, where

- ${n}_{A,1}^{j}=tn$ and ${n}_{A,2}^{j}=(1-t)n$, where 0.5 ≤
*t*< 1 is a constant. - ${n}_{A,1}^{j}=Tn$ and ${n}_{A,2}^{j}=(1-T)n$, where
*T*is uniformly distributed within (0, 1). - ${n}_{A,1}^{j}=c$ and ${n}_{A,2}^{j}=n-c$, where c > 0 is a constant.

We see that case 2 represents a more realistic picture between the two extremes in case 1, namely *t* = 0.5 and *t* is close to 1, and case 3 is the limit of case 1 as 1 – *t* goes to *c/n*. In effect, case 1 indicates that the number of levels to be explored is log_{1/}* _{t} n*, and case 3 indicates that the number of level is

The simplest implementation of the OPT algorithm is to use depth-first search, which we call DF-OPT. We find DF-OPT to be very inefficient in practice, partly because regions that can be reached through many different paths of binary partitions will have their Φ(·) values computed repeatedly, which is unnecessary. A more efficient implementation is to cache all the computed Φ(·) values so that for each region *A*, Φ(*A*) will need to be computed only once. In practice we find that this cached implementation can accelerate computation substantially. Therefore, we use the cached implementation as the default implementation for all the simulations (denoted as OPT in experiments). Comparisons between DF-OPT and cached OPT are given in Section 6. It is worthwhile pointing out here that although cached OPT is much faster than DF-OPT, it only works with limited sample sizes and dimensions because it uses much larger amount of memory. In contrast, DF-OPT has the advantage that its memory usage is minimal. Furthermore, if a partitioning scheme other than the binary partitioning is used, caching might become less effective or even non-effective.

We found that in practice, the cached implementation of OPT is more likely to be limited by memory size rather than by running time. Therefore, we also study the space complexity of the cached OPT here.

**Proposition 4.2.** Under the binary partitioning scheme, the total number of regions at level *k* is
$(\underset{k}{k+p-1}){2}^{k}$, where *p* is the dimension of the sample space. The total number of regions up to level *k* is therefore
${\sum}_{i=0}^{k}(\underset{i}{i+p-1}){2}^{i}$.

If we assume that the depth of the partition tree is on the order of log_{2}
*n* where *n* is the sample size, which is the case when the samples in Ω are distributed roughly uniformly, the space required by the cached OPT is
$O\{{\sum}_{i=0}^{{log}_{2}n}(\underset{i}{i+p-1}){2}^{i}\}=O\{n(\underset{{log}_{2}n}{({log}_{2}n)+p-1})\}$.

The LL-OPT algorithm is more friendly in CPU and memory requirements because for each cut it only looks ahead for *h* steps, and as a result, the time and space complexity for a single cut is bounded.

**Theorem 4.3.** For the LL-OPT algorithm with parameter *h*, the time complexity is *f*(*n*) = *O*{*n*(*n* + 2* ^{h}p*)

Theorems 4.1 and 4.3 show that the reduction on running time using the LL-OPT algorithm can be significant, especially when *n* and *p* are large. The amount of reduction is controlled by the tuning parameter *h*. The smaller the *h*, the faster the LL-OPT algorithm. However, a smaller *h* also leads to less accurate estimation. The following adaptive approach can be used for selecting *h*: run the LL-OPT algorithm many times starting with *h* = 1 and increase *h* by 1 each time until a stopping decision is made (either based on a cost constraint or a criterion evaluating the gain between two successive estimates). Theorem 4.3 shows that the total time for running the LL-OPT algorithm for *h* = 1, …, *k* – 1 is less than the run time for a single *h* = *k*.

The densities estimated by OPT are piecewise constant and have discontinuities at boundaries between the pieces, which could become undesirable for applications where continuous densities are sought. In this section we introduce a procedure which constructs a continuous piecewise linear approximation to the piecewise constant density estimated by OPT via quadratic programming. We call the continuous piecewise linear density estimator constructed using this approach the finite element estimator (FEE), because it was inspired by the idea of domain partitioning in finite element method [Allaire, 2007].

We define a proper space for continuous piecewise linear densities as follows. We start with the definition of a triangulation of the domain Ω (without loss of generality, we shall assume Ω is [0, 1]* ^{p}*) by

**Definition 5.1.** Let Ω = [0, 1]* ^{p}*, a triangulation of Ω is a set Γ of

- Δ
Ω (_{i}*i*= 1 …*m*) and $\mathrm{\Omega}={\cup}_{i=1}^{m}{\mathrm{\Delta}}_{i}$ - For any two distinct simplices Δ
and Δ_{i}, if they share_{j}*k*(1 ≤*k*≤*p*) vertices, ΔΔ_{i}is a (_{j}*k*− 1)-simplex formed by these*k*vertices, or an empty set otherwise. This condition guarantees that no vertex of Δlies on the faces of Δ_{i}and vice versa. This can help exclude the undesired situations as shown in Figure 2._{j}

**Definition 5.2.** For an OPT partition
$\mathrm{\Omega}={\cup}_{i=1}^{l}{A}_{i}$, a triangulation of the OPT partition is a triangulation Γ of Ω where each region *A _{i}* is also triangulated by a subset of Γ. See Figure 3 for an example.

**Definition 5.3.** Given a triangulation Γ of an OPT partition of Ω, a function *f* : Ω → is called continuous piecewise linear over Γ if *f* is linear within each simplex in Γ. The set of continuous piecewise linear functions over Γ is denoted as _{1}(Γ).

Let the set {*a*_{1} … *a _{k}*} be all the vertices in Γ, we define

It has been shown that the piecewise constant density estimated by OPT is weakly consistent [Wong and Ma, 2010, Theorem 4]. As the sample size grows, all the probability will be asymptotically concentrated in any weak neighborhood of the true distribution with bounded density. Here, we aim at achieving the following two properties with the continuous piecewise linear FEE.

- (fidelity property) The difference between the piecewise constant density and the continuous piecewise linear density must be small.
- (smoothness property) The variation of the continuous piecewise linear density within each region of the OPT partition must be small.

Unfortunately, in practice, these two properties often contradict each other. Therefore, we use penalized optimization to achieve a trade-off. Let *f* _{1}(Γ) be a density function which is continuous piecewise linear over Γ, the fidelity property can be enforced by the following penalty *p*(·)

$$p(f)=\sum _{j=1}^{m}{w}_{j}{[{\int}_{{\mathrm{\Delta}}_{j}}f(x)\mathit{\text{dx}}-Q({\mathrm{\Delta}}_{j})]}^{2},$$

where Δ* _{j}* is the

The smoothness property implies that the density function in any given simplex should be as flat as possible. Therefore, in a density plot (see Figure 5 for an example), the volume of the simplex on the top, denoted as *μ**(Δ* _{j}*), should be approximately equal to the volume of the simplex on the bottom, which is

$$q(f)=\sum _{i=1}^{m}{\left[\frac{\mu \ast ({\mathrm{\Delta}}_{j})}{\mu ({\mathrm{\Delta}}_{j})}\right]}^{2}.$$

To enforce both properties simultaneously, we minimize *h*(*f*) = *p*(*f*) + *λq*(*f*), where *λ* is a tuning parameter. Therefore, the continuous piecewise linear density estimate *f* can be obtained by solving the following optimization problem.

$$f=\underset{f\in {\mathcal{P}}_{1}(\mathrm{\Gamma})}{arg\; min}\sum _{j=1}^{m}{w}_{j}{[{\int}_{{\mathrm{\Delta}}_{j}}f(x)\mathit{\text{dx}}-Q({\mathrm{\Delta}}_{j})]}^{2}+\lambda \sum _{i=1}^{m}{\left[\frac{\mu \ast ({\mathrm{\Delta}}_{j})}{\mu ({\mathrm{\Delta}}_{j})}\right]}^{2},\text{subject to}\phantom{\rule{0.2em}{0ex}}f(x)\ge 0\phantom{\rule{0.2em}{0ex}}\text{for all}\phantom{\rule{0.2em}{0ex}}x\in \mathrm{\Omega}\phantom{\rule{0.2em}{0ex}}\text{and}\sum _{i=1}^{m}{\int}_{{\mathrm{\Delta}}_{i}}f(x)\mathit{\text{dx}}={\int}_{\mathrm{\Omega}}f(x)\mathit{\text{dx}}=1.$$

(5.1)

Within a given simplex Δ* _{j}* with vertices {

$$f(x)=\sum _{i=1}^{p+1}{c}_{i}{\varphi}_{i}(x),$$

where *c _{i}* ≥ 0. We have

$$\mu ({\mathrm{\Delta}}_{j})=\frac{1}{p!}|det(\begin{array}{ccccc}{a}_{1,1}& {a}_{1,2}& \cdots & {a}_{1,p}& 1\\ {a}_{2,1}& {a}_{2,2}& \cdots & {a}_{2,p}& 1\\ & & \cdots & & \\ {a}_{p+1,1}& {a}_{p+1,2}& \cdots & {a}_{p+1,p}& 1\end{array}\left)\right|,{\int}_{{\mathrm{\Delta}}_{j}}f(x)\mathit{\text{dx}}=\frac{\mu ({\mathrm{\Delta}}_{j})}{p+1}({c}_{1}+\cdots +{c}_{p+1}).$$

and

$$\mu \ast ({\mathrm{\Delta}}_{j})=\frac{1}{p!}{\Vert det\left(\begin{array}{cccc}{\mathbf{i}}_{1}& \cdots & {\mathbf{i}}_{p}& {\mathbf{i}}_{p+1}\\ {a}_{2,1}-{a}_{1,1}& \cdots & {a}_{2,p}-{a}_{1,p}& {c}_{2}-{c}_{1}\\ & & \cdots & \\ {a}_{p+1,1}-{a}_{1,1}& \cdots & {a}_{p+1,p}-{a}_{1,p}& {c}_{p+1}-{c}_{1}\end{array}\right)\Vert}_{2}$$

where ‖ · ‖_{2} is the Euclidean norm and **i*** _{j}* (

We compare the running times and the estimation accuracies of the exact OPT and the LL-OPT algorithms. In our simulations, the root region Ω is the unit cube [0, 1]* ^{p}* in

**Example 6.1.** We consider a mixture distribution of two independent components over the unit square [0, 1] × [0, 1] [Wong and Ma, 2010, Example 8]. The first component is a uniform distribution over [0.78, 0.80] × [0.2, 0.8]. The second component has support [0.25, 0.4] × [0, 1] with *X* being uniform over [0.25, 0.4] and *Y* being Beta(100, 120). The density function is therefore

$$\frac{0.35}{0.012}\times {\mathbf{1}}_{[0.78,0.80]\times [0.2,0.8]}+\frac{0.65}{0.15}\times \frac{\mathrm{\Gamma}(220)}{\mathrm{\Gamma}(120)\mathrm{\Gamma}(100)}{y}^{99}{(1-y)}^{199}{\mathbf{1}}_{[0.25,0.4]\times [0,1]}.$$

(6.1)

Figure 6 shows the partition plots based on the exact OPT and the LL-OPT algorithms with *h* = 1, …, 5, respectively, for a sample size of 1000. The plots show how the root region Ω is partitioned according to the estimated tree topologies. A good partition plot should visually resemble a contour plot of the distribution to be estimated. On these partition plots, the two components of the mixture are clearly marked out by the two vertical strips located on the left-half and the right-half of the square, respectively. Furthermore, the non-uniformity of the Beta distribution in the left component is also visible through the further division of the corresponding strip.

Comparing the partition plots between the exact OPT and the LL-OPT algorithms, the partition estimated by the LL-OPT algorithm, with *h* as small as 2, resembles closely the partition estimated by the exact OPT algorithm with only minor differences. As *h* increases, the resemblance becomes even stronger, which is expected as the two algorithms should lead to exactly the same result when *h* is greater than the maximum depth.

The average running times used by the exact OPT algorithm and the LL-OPT algorithm with different *h* are summarized in Table 1, and the average Hellinger distances (computed based 100, 000 importance samples) between the estimated and the true densities are summarized in Table 2. The Dirichlet process Gaussian mixture models (DP-GMM) [Escobar and West, 1995] is also included in the comparison. All the summary statistics are based on 5 replications and corresponding standard deviations are included in the parentheses.

Running times (in seconds) for Example 6.1. Average of 5 replicates, standard deviation reported in parentheses. The DF-OPT and DP-GMM algorithms could not finish some of the large-sample simulations within reasonable time.

Estimation errors (in Hellinger distance) for Example 6.1. Average of 5 replicates, standard deviation reported in parentheses. The DF-OPT and DP-GMM algorithms could not finish some of the large-sample simulations within reasonable time.

Based on Table 1, we find that the LL-OPT algorithm achieves substantial speedup compared with the exact OPT algorithm, especially when the sample size is large. The other algorithm for exact OPT inference, the DF-OPT algorithm, is computationally prohibitive for large sample size.

Based on Table 2, the accuracy of estimates of the LL-OPT algorithm is comparable to that of the exact OPT algorithm, with the value of *h* as low as 2, which is encouraging since the improvement on efficiency is sizable. For example, when *h* = 2, the LL-OPT algorithm is 15 times faster than the exact OPT algorithm for sample size of 10^{5}. This makes the LL-OPT algorithm appealing for large samples. Furthermore, the estimates of the LL-OPT algorithms with *h* ≥ 2 are almost identical, so the adaptive approach for selecting *h* would work well.

**Example 6.2.** We now consider a distribution on [0, 1]^{4} with a bivariate Normal distribution in the first two dimensions and uniform in the other two dimensions. In this scenario, *X* and *Y* components are independent of each other and follow Norm(0.6, 0.1^{2}) and Norm(0.4, 0.1^{2}) respectively [Wong and Ma, 2010, adapted from Example 9], and *Z* and *W* components are independent and uniform in [0, 1]. As the probability mass of this distribution is essentially all located in [0, 1] × [0, 1], the density function can be approximated by

$$\frac{1}{2\pi \times {0.1}^{2}}{e}^{-\frac{{(x-0.6)}^{2}+{(y-0.4)}^{2}}{2\times {0.1}^{2}}}.$$

(6.2)

The average running times and average Hellinger distances are summarized in Tables 3 and and44 respectively. All the summary statistics are based on 5 replications and corresponding standard deviations are included in the parentheses.

Running times (in seconds) for Example 6.2. Average of 5 replicates, standard deviation reported in parentheses. The exact OPT algorithm could not finish the simulation with 10^{5} samples in under 128GB of memory.

Estimation errors (in Hellinger distance) for Example 6.2. Average of 5 replicates, standard deviation reported in parentheses. The exact OPT algorithm could not finish the simulation with 10^{5} samples in under 128GB of memory.

Similar to the previous example, the estimation accuracy of the LL-OPT algorithm approaches that of the exact OPT algorithm as the value of *h* increases. Patterns of running times found in Table 3 are also similar to that of the previous example. According to Table 4, the estimation accuracy of the LL-OPT algorithm with *h* as low as 3 is comparable to that of the exact OPT algorithm. The improvements on running times are still sizable. When *h* = 3, the LL-OPT algorithm is over 40 times faster than the exact OPT algorithm for sample size of 10^{3}, and over 190 times faster for sample size of 10^{4}. Finally, estimation based on the LL-OPT algorithm stabilizes for *h* ≥ 3.

In both simulation examples, when the sample size is large (e.g., 10^{5}), we can see that the running time of the LL-OPT algorithm increases roughly by a factor of *p* when *h* increases by 1, which is consistent with the time complexity analysis provided in Theorem 4.3.

We compare the performance of the exact OPT, the LL-OPT and the FEE algorithms for density estimation to several existing methods for density estimation including kernel density estimation (KDE) [Silverman, 1986], Dirichlet process Gaussian mixture models (DP-GMM) [Escobar and West, 1995] and mixture of Pólya trees (MPT) [Lavine, 1992].

**Example 6.3.** We consider a strongly skewed distribution

$$x~K\mathrm{\Gamma}(2,0.1){\mathbf{1}}_{[0,1]}$$

(6.3)

where $\mathrm{\Gamma}(a,b)=\frac{1}{{b}^{a}\mathrm{\Gamma}(a)}{x}^{a-1}{e}^{\frac{-x}{b}}$, and $K=1/{\int}_{0}^{1}\mathrm{\Gamma}(2,0.1)\mathit{\text{dx}}$ is the normalizing constant.

Figure 7 shows the estimated densities by the exact OPT, the LL-OPT and the FEE algorithms for sample size *N* = 10^{4}. Table 5 shows the estimation errors of the three methods as well as KDE, DP-GMM and MPT measured by Hellinger distance. FEE gives the best results among all the methods.

**Example 6.4.** We consider a mixture of a uniform distribution and Beta distributions.

$${X}_{1}~U(0,1);{X}_{2}~\frac{4}{5}\beta (2,10)+\frac{1}{5}\beta (7,2)$$

(6.4)

Figure 8 (a) shows the densities estimated by the LL-OPT and the FEE algorithms. As a comparison, kernel density estimated is in Figure 8 (b). We see that the uniformity in the first dimension of the distribution is well captured by OPT based approaches, but not by KDE.

**Example 6.5.** We consider a 5-dimensional random vector

$$\mathbf{Y}=({X}_{1},\frac{1}{3}{X}_{1}+\frac{2}{3}{X}_{2},{X}_{3},{X}_{4},\frac{1}{5}{X}_{3}+\frac{4}{5}{X}_{5}),$$

where *X*_{1}, …, *X*_{5} are independent and sampled from

$${X}_{1}~\beta (2,8),{X}_{2}~\beta (8,2),{X}_{3}~U(0,1)$$

(6.5)

$${X}_{4}~{K}_{1}\mathrm{\Gamma}(2,1){\mathbf{1}}_{[0,1]},{X}_{5}~{K}_{2}\mathrm{\Gamma}(1,2){\mathbf{1}}_{[0,1]}$$

(6.6)

where $\mathrm{\Gamma}(a,b)=\frac{1}{{b}^{a}\mathrm{\Gamma}(a)}{x}^{a-1}{e}^{\frac{-x}{b}}$, and ${K}_{1}=1/{\int}_{0}^{1}\mathrm{\Gamma}(2,1)\mathit{\text{dx}}$ and ${K}_{2}=1/{\int}_{0}^{1}\mathrm{\Gamma}(1,2)\mathit{\text{dx}}$ are the normalizing constants.

With a sample size of 6 × 10^{3}, the Hellinger distances for the LL-OPT and the FEE algorithms are 0.31 and 0.27, respectively. Figure 9 shows marginal histograms based on 10^{4} random samples drawn from the fitted FEE density and the true density. We can see that the fitted density resembles closely the true density in every dimension.

Classification is a classical application of density estimation and it covers a wide range of real world problems. We apply the LL-OPT approach to the MiniBooNE particle identification data set [Bache and Lichman, 2013] where a classifier is desired to distinguish electron neutrinos from muon neutrinos. There are a total of 130, 065 data points (36, 499 electron neutrinos and 93, 565 muon neutrinos) in the data set, and each data point has a dimension of 50. We divide the dateset into two equal parts, use one part for training and the other part for testing. Using the training data set, we apply LL-OPT to estimate the density for each class *C _{i}*,

We provide two software packages implementing the algorithms described in this paper, available at http://www.stanford.edu/group/wonglab/opt_comp/. The **fast-opt** package implements both the exact OPT and the LL-OPT algorithms described in Section 2 and 3. The **smooth-opt** package implements the smoothing algorithm described in Section 5. Both packages are implemented in C++.

The **smooth-opt** package relies on the user to specify the value for the tuning parameter *λ*. An alternative approach is to use cross-validation to choose *λ* which requires much more computation. In our simulations we found that the result is not sensitive to the value of *λ* and we use *λ* = 10^{−3} or 10^{−4} in all our simulations.

In our experiments, KDE was based on R package **ks** (function **kde**), DP-GMM was based on R package **bayesm** (function **rDPGibbs**), MPT was based on R package **DPpackage** (function **PTdensity**) and SVM was based on R package **e1071** (function svm). For all parameters we used either default values whenever available or the values suggested by the user manuals.

High dimensional density estimation is a challenging problem. The OPT is an adaptive non-parametric density estimation approach which fits to the global landscape of the data adaptively rather than locally as in kernel density estimation [Silverman, 1986], as demonstrated in our simulations. The simulations also suggest that our continuous piecewise linear FEE further improves upon the piecewise constant counterpart. The proposed FEE is a penalized regression approach. Based on a fast quadratic programming solver, it achieves decent efficiency when the sample size is reasonably high (10^{5} data points). The resulting estimated density function is continuous piecewise linear, which is mostly sufficient when we want to explore how the probability mass is distributed or visualize the estimated density. When necessary, the FEE can be extended to use higher order basis functions, at the cost of heavier computation.

This work was supported by NSF grants DMS-0906044, DMS-1407557, DMS-1330132 and NIH grant R01-HG006018. We thank IBM^{®} for providing the academic license of Cplex^{®}.

Supplementary Materials: Appendix. Proofs of theorems and propositions. (supplementary.pdf)

Hui Jiang, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109.

John Chong Mu, Department of Electrical Engineering, Stanford University, Stanford, CA 94305.

Kun Yang, Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305.

Chao Du, Department of Statistics, University of Virginia, Charlottesville, VA 22904.

Luo Lu, Twitter Inc., San Francisco, CA 94103.

Wing Hung Wong, Department of Statistics and Health Research and Policy, Stanford University, Stanford, CA 94305.

- Allaire G. Numerical Analysis and Optimization. Oxford Science Publications; 2007.
- Bache K, Lichman M. Uci machine learning repository. 2013;19 URL http://archive.ics.uci.edu/ml.
- Barron AR, Gyorfi L, van der Meulen EC. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans Inf Theor. 1992;38(5):1437–1454.
- Denison DG, Mallick BK, Smith AF. A bayesian cart algorithm. Biometrika. 1998;85(2):363–377.
- Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association. 1995;90(430):577–588.
- Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics. 1973;1(2):209–230.
- Ferguson TS. Prior Distributions on Spaces of Probability Measures. The Annals of Statistics. 1974;2(4):615–629.
- Freedman DA. On the asymptotic behavior of bayes' estimates in the discrete case. The Annals of Mathematical Statistics. 1963:1386–1403.
- Ghosh JK, Ramamoorthi R. Bayesian nonparametrics. Vol. 1. Springer; 2003.
- Hutter M. Exact non-parametric bayesian inference on infinite trees. arXiv preprint arXiv:0903.5342. 2009
- Kraft CH. A class of distribution function processes which have derivatives. Journal of Applied Probability. 1964;1(2):385–388.
- Lavine M. Some aspects of Polya tree distributions for statistical modelling. The Annals of Statistics. 1992:1222–1235.
- Lavine M. More aspects of Polya tree distributions for statistical modelling. The Annals of Statistics. 1994:1161–1176.
- Lo AY. On a Class of Bayesian Nonparametric Estimates: I. Density Estimates. The Annals of Statistics. 1984;12(1):351–357.
- Ma L, Wong WH. Coupling optional pólya trees and the two sample problem. Journal of the American Statistical Association. 2011;106(496):1553–1565.
- Mauldin RD, Sudderth WD, Williams S. Polya trees and random distributions. The Annals of Statistics. 1992:1203–1221.
- Parzen E. On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics. 1962;33(3):1065–1076.
- Rosenblatt M. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics. 1956;27(3):832–837.
- Scott DW. On optimal and data-based histograms. Biometrika. 1979;66(3):605–610.
- Silverman B. Monographs on statistics and applied probability. Chapman and Hall; 1986. Density estimation for statistics and data analysis.
- Wong WH, Ma L. Optional Pólya tree and Bayesian inference. Ann Stat. 2010;38(3):1433–1459.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |