Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5599182

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Regression forests
- 3 Classification forests
- 4 Randomized adaptive splitting rules
- 5 Discussion
- References

Authors

Related links

Mach Learn. Author manuscript; available in PMC 2017 September 14.

Published in final edited form as:

PMCID: PMC5599182

NIHMSID: NIHMS902500

Hemant Ishwaran, University of Miami;

The effect of a splitting rule on random forests (RF) is systematically studied for regression and classification problems. A class of weighted splitting rules, which includes as special cases CART weighted variance splitting and Gini index splitting, are studied in detail and shown to possess a unique adaptive property to signal and noise. We show for noisy variables that weighted splitting favors end-cut splits. While end-cut splits have traditionally been viewed as undesirable for single trees, we argue for deeply grown trees (a trademark of RF) end-cut splitting is useful because: (a) it maximizes the sample size making it possible for a tree to recover from a bad split, and (b) if a branch repeatedly splits on noise, the tree minimal node size will be reached which promotes termination of the bad branch. For strong variables, weighted variance splitting is shown to possess the desirable property of splitting at points of curvature of the underlying target function. This adaptivity to both noise and signal does not hold for unweighted and heavy weighted splitting rules. These latter rules are either too greedy, making them poor at recognizing noisy scenarios, or they are overly ECP aggressive, making them poor at recognizing signal. These results also shed light on pure random splitting and show that such rules are the least effective. On the other hand, because randomized rules are desirable because of their computational efficiency, we introduce a hybrid method employing random split-point selection which retains the adaptive property of weighted splitting rules while remaining computational efficient.

One of the most successful ensemble learners is random forests (RF), a method introduced by Leo Breiman (Breiman, 2001). In RF, the base learner is a binary tree constructed using the methodology of CART (Classification and Regression Tree) (Breiman et al., 1984); a recursive procedure in which binary splits recursively partition the tree into homogeneous or near-homogeneous terminal nodes (the ends of the tree). A good binary split partitions data from the parent tree-node into two daughter nodes so that the ensuing homogeneity of the daughter nodes is improved from the parent node. A collection of *ntree* > 1 trees are grown in which each tree is grown independently using a bootstrap sample of the original data. The terminal nodes of the tree contain the predicted values which are tree-aggregated to obtain the forest predictor. For example, in classification, each tree casts a vote for the class and the majority vote determines the predicted class label.

RF trees differ from CART as they are grown nondeterministically using a two-stage randomization procedure. In addition to the randomization introduced by growing the tree using a bootstrap sample, a second layer of randomization is introduced by using random feature selection. Rather than splitting a tree node using all *p* variables (features), RF selects at each node of each tree, a random subset of 1 ≤ *mtry* ≤ *p* variables that are used to split the node where typically *mtry* is substantially smaller than *p*. The purpose of this two-step randomization is to decorrelate trees and reduce variance. RF trees are grown deeply, which reduces bias. Indeed, Breiman’s original proposal called for splitting to purity in classification problems. In general, a RF tree is grown as deeply as possible under the constraint that each terminal node must contain no fewer than *nodesize* ≥ 1 cases.

The splitting rule is a central component to CART methodology and crucial to the performance of a tree. However, it is widely believed that ensembles such as RF which aggregate trees are far more robust to the splitting rule used. Unlike trees, it is also generally believed that randomizing the splitting rule can improve performance for ensembles. These views are reflected by the large literature involving hybrid splitting rules employing random split-point selection. For example, Dietterich (2000) considered bagged trees where the split-point for a variable is randomly selected from the top 20 split-points based on CART splitting. Perfect random trees for ensemble classification (Cutler and Zhao, 2001) randomly chooses a variable and then chooses the split-point for this variable by randomly selecting a value between the observed values from two randomly chosen points coming from different classes. Ishwaran et al. (2008, 2010) considered a partially randomized splitting rule for survival forests. Here a fixed number of randomly selected split-points are chosen for each variable and the top split-point based on a survival splitting rule is selected. Related work includes Geurts et al. (2006) who investigated extremely randomized trees. Here a single random split-point is chosen for each variable and the top split-point is selected.

The most extreme case of randomization is pure random splitting in which both the variable and split-point for the node are selected entirely at random. Large sample consistency results provides some rationale for this approach. Biau, Devroye, and Lugosi (2008) proved Bayes-risk consistency for RF classification under pure random splitting. These results make use of the fact that partitioning classifiers such as trees approximate the true classification rule if the partition regions (terminal nodes) accumulate enough data. Sufficient accumulation of data is possible even when partition regions are constructed independently of the observed class label. Under random splitting, it is sufficient if the number of splits *k _{n}* used to grow the tree satisfies

At the same time, forests grown under CART splitting rules have been shown to have excellent performance in a wide variety of applied settings, suggesting that adaptive splitting must have benefits. Theoretical results support these findings. Lin and Jeon (2006) considered mean-squared error rates of estimation in nonparametric regression for forests constructed under pure random splitting. It was shown that the rate of convergence cannot be faster than *M*^{−1}(log *n*) ^{−(}^{p}^{−1)} (*M* equals *nodesize*), which is substantially slower than the optimal rate *n*^{−2}^{q/}^{(2}^{q}^{+}^{p}^{)} [*q* is a measure of smoothness of the underlying regression function; Stone (1980)]. Additionally, while Biau (2012) proved consistency for non-adaptive RF models, it was shown that successful forest applications in high-dimensional sparse settings requires data adaptive splitting. When the variable used to split a node is selected adaptively, with strong variables (true signal) having a higher likelihood of selection than noisy variables (no signal), then the rate of convergence can be made to depend only on the number of strong variables, and not the dimension *p*. See the following for a definition of strong and noisy variables which shall be used throughout the manuscript [the definition is related to the concept of a “relevant” variable discussed in Kohavi and John (1997)].

*If*
**X**
*is the p-dimensional feature and Y is the outcome, we call a variable X* **X**
*noisy if the conditional distribution of Y given*
**X**
*does not depend upon X. Otherwise, X is called strong. Thus, strong variables are distributionally related to the outcome but noisy variables are not.*

In this paper we formally study the effect of splitting rules on RF in regression and classification problems (Sections 2 and 3). We study a class of weighted splitting rules which includes as special cases CART weighted variance splitting and Gini index splitting. Such splitting rules possess an *end-cut preference* (ECP) splitting property (Morgan and Messenger, 1973; Breiman et al., 1984) which is the property of favoring splits near the edge for noisy variables (see Theorem 4 for a formal statement). The ECP property has generally been considered an undesirable property for a splitting rule. For example, according to Breiman et al. (Chapter 11.8; 1984), the delta splitting rule used by THAID (Morgan and Messenger, 1973) was introduced primarily to suppress ECP splitting.

Our results, however, suggest that ECP splitting is very desirable for RF. The ECP property ensures that if the ensuing split is on a noisy variable, the split will be near the edge, thus maximizing the tree node sample size and making it possible for the tree to recover from the split downstream. Even for a split on a strong variable, it is possible to be in a region of the space where there is near zero signal, and thus an ECP split is of benefit in this case as well. Such benefits are realized only if the tree is grown deep enough—but deep trees are a trademark of RF. Another aspect of RF making it compatible with the ECP property is random feature selection. When *p* is large, or if *mtry* is small relative to *p*, it is often the case that many or all of the candidate variables will be noisy, thus making splits on noisy variables very likely and ECP splits useful. Another benefit occurs when a tree branch repeatedly splits on noise variables, for example if the node corresponds to a region in the feature space where the target function is flat. When this happens, ECP splits encourage the tree minimal node size to be reached rapidly and the branch terminates as desired.

While the ECP property is important for handling noisy variables, a splitting rule should also be adaptive to signal. We show that weighted splitting exhibits such adaptivity. We derive the optimal split-point for weighted variance splitting (Theorem 1) and Gini index splitting (Theorem 8) under an infinite sample paradigm. We prove the population split-point is the limit of the empirical split-point (Theorem 2) which provides a powerful theoretical tool for understanding the split-rule [this technique of studying splits under the true split function has been used elsewhere; for example Buhlmann and Yu (2002) looked at splitting for stumpy decision trees in the context of subagging]. Our analysis reveals that weighted variance splitting encourages splits at points of curvature of the underlying target function (Theorem 3) corresponding to singularity points of the population optimizing function. Weighted variance splitting is therefore adaptive to both signal and noise. This appears to be a unique property. To show this, we contrast the behavior of weighted splitting to the class of unweighted and heavy weighted splitting rules and show that the latter do not possess the same adaptivity. They are either too greedy and lack the ECP property (Theorem 7), making them poor at recognizing noisy variables, or they have too strong an ECP property, making them poor at identifying strong variables. These results also shed light on pure random splitting and show that such rules are the least desirable. Randomized adaptive splitting rules are investigated in Section 4. We show that certain forms of randomization (Theorem 10) are able to preserve the useful properties of a splitting rule while significantly reducing computational effort.

As a motivating example, *n* = 1000 observations were simulated from a two-class problem in which the decision boundary was oriented obliquely to the coordinate axes of the features. In total *p* = 5 variables were simulated: the first two were defined to be the strong variables defining the decision boundary; the remaining three were noise variables. All variables were simulated independently from a standard normal distribution. The first row of panels in Figure 1 displays the decision boundary for the data under different splitting rules for a classification tree grown to purity. The boundary is shown as a function of the two strong variables. The first panel was grown under pure random splitting (i.e., the split-point and variable used to split a node were selected entirely at random), the remaining panels used unweighted, heavy weighted and weighted Gini index splitting, respectively (to be defined later). We observe random splitting leads to a heavily fragmented decision boundary, and that while unweighted and heavy weighted splitting perform better, unweighted splitting is still fragmented along horizontal and vertical directions, while heavy weighted splitting is fragmented along its boundary.

Synthetic two-class problem where the true decision boundary is oriented obliquely to the coordinate axes for the first two features (*p* = 5). Top panel is the decision boundary for a single tree with nodesize = 1 grown under pure random splitting, unweighted, **...**

The latter boundaries occur because (as will be demonstrated) unweighted splitting possesses the strongest ECP property, which yields deep trees, but its relative insensitivity to signal yields a noisy boundary. Heavy weighted splitting does not possess the ECP property, and this reduces overfitting because it is shallower, but its boundary is imprecise because it also has a limited ability to identify strong variables. The best performing tree is weighted splitting. However, all decision boundaries, including weighted splitting, suffer from high variability—a well known deficiency of deep trees. In contrast, consider the lower row which displays the decision boundary for a forest of 1000 trees grown using the same splitting rule as the panel above it. There is a noticeable improvement in each case; however, notice how forest boundaries mirror those found with single trees: pure random split forests yield the most fragmented decision boundary, unweighted and heavy weighted are better, while the weighted variance forest performs best.

This demonstrates, among other things, that while forests are superior to single trees, they share the common property that their decision boundaries depend strongly on the splitting rule. Notable is the superior performance of weighted splitting, and in light of this we suggest two reasons why its ECP property has been under-appreciated in the CART literature. One explanation is the potential benefit of end-cut splits requires deep trees applied to complex decision boundaries—but deep trees are rarely used in CART analyses due to their instability. A related explanation is that ECP splits can prematurely terminate tree splitting when *nodesize* is large: a typical setting used by CART. Thus, we believe the practice of using shallow trees to mitigate excess variance explains the lack of appreciation for the ECP property. See Torgo (2001) who discussed benefits of ECP splits and studied ECP performance in regression trees.

We begin by first considering the effect of splitting in regression settings. We assume the learning (training) data is = {(**X**_{1}*, Y*_{1})*,*…*,* (**X*** _{n}, Y_{n}*)} where (

$${Y}_{i}=f({\mathbf{X}}_{i})+{\epsilon}_{i},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}i=1\dots ,n,$$

(1)

where *f* : * ^{p}* → is an unknown function and (

In CART methodology a tree is grown by recursively reducing impurity. To accomplish this, each parent node is split into daughter nodes using the variable and split-point yielding the greatest decrease in impurity. The optimal split-point is obtained by optimizing the CART splitting rule. But how does the optimized split-point depend on the underlying regression function *f*? What are its properties when *f* is flat, linear, or wiggly? Understanding how the split-point depends on *f* will give insight into how splitting affects RF.

Consider splitting a regression tree *T* at a node *t*. Let *s* be a proposed split for a variable *X* that splits *t* into left and right daughter nodes *t _{L}* and

$$\widehat{\mathrm{\Delta}}(t)=\frac{1}{N}\sum _{{\mathbf{X}}_{i}\in t}{({Y}_{i}-{\overline{Y}}_{t})}^{2},$$

where * _{t}* is the sample mean for

$$\widehat{\mathrm{\Delta}}({t}_{L})=\frac{1}{{N}_{L}}\sum _{i\in {t}_{L}}{({Y}_{i}-{\overline{Y}}_{{t}_{L}})}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\widehat{\mathrm{\Delta}}({t}_{R})=\frac{1}{{N}_{R}}\sum _{i\in {t}_{R}}{({Y}_{i}-{\overline{Y}}_{{t}_{R}})}^{2},$$

where _{tL} is the sample mean for *t _{L}* and

$$\widehat{\mathrm{\Delta}}(s,t)=\widehat{\mathrm{\Delta}}(t)-\left[\widehat{p}({t}_{L})\widehat{\mathrm{\Delta}}({t}_{L})+\widehat{p}({t}_{R})\widehat{\mathrm{\Delta}}({t}_{R})\right],$$

where (*t _{L}*) =

Throughout we will define left and right daughter nodes in terms of splits of the form *X* ≤ *s* and *X* > *s* which assumes a continuous *X* variable. In general, splits can be defined for categorical *X* by moving data points left and right using the complementary pairings of the factor levels of *X* (if there are *L* distinct labels, there are 2^{L}^{−1} − 1 distinct complementary pairs). However, for notational convenience we will always talk about splits for continuous *X*, but our results naturally extend to factors.

The tree *T* is grown by finding the split-point *s* that maximizes (*s, t*) (Chapter 8.4; Breiman et al., 1984). We denote the optimized split-point by *ŝ _{N}*. Maximizing (

$$\widehat{D}(s,t)=\widehat{p}({t}_{L})\widehat{\mathrm{\Delta}}({t}_{L})+\widehat{p}({t}_{R})\widehat{\mathrm{\Delta}}({t}_{R}).$$

(2)

In other words, CART seeks the split-point *ŝ _{N}* that minimizes the weighted sample variance. We refer to (2) as the weighted variance splitting rule.

To theoretically study *ŝ _{N}*, we replace (

$$\mathrm{\Delta}(s,t)=\mathrm{\Delta}(t)-\left[p({t}_{L})\mathrm{\Delta}({t}_{L})+p({t}_{R})\mathrm{\Delta}({t}_{R})\right],$$

where Δ(*t*) is the conditional population variance

$$\mathrm{\Delta}(t)=\text{Var}\phantom{\rule{0.16667em}{0ex}}(Y\mid \mathbf{X}\in t),$$

and Δ(*t _{L}*) and Δ(

$$\mathrm{\Delta}({t}_{L})=\text{Var}\phantom{\rule{0.16667em}{0ex}}(Y\mid X\le s,\mathbf{X}\in t),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\mathrm{\Delta}({t}_{R})=\text{Var}\phantom{\rule{0.16667em}{0ex}}(Y\mid X>s,\mathbf{X}\in t),$$

and *p*(*t _{L}*) and

$$p({t}_{L})=\mathbb{P}\{X\le s\mid \mathbf{X}\in t\},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}p({t}_{R})=\mathbb{P}\{X>s\mid \mathbf{X}\in t\}.$$

One can think of Δ(*s, t*) as the tree splitting rule under an infinite sample setting. We optimize the infinite sample splitting criterion in lieu of the data optimized one (2). Shortly we describe conditions showing that this solution corresponds to the limit of *ŝ _{N}*. The population analog to (2) is

$$D(s,t)=p({t}_{L})\mathrm{\Delta}({t}_{L})+p({t}_{R})\mathrm{\Delta}({t}_{R}).$$

(3)

Interestingly, there is a solution to (3) for the one-dimensional case (*p* = 1). We state this formally in the following result.

*Let* _{t} denote the conditional distribution for X given that X*t. Let* _{tL}(·) *and* _{tR}(·) *denote the conditional distribution of X given that X* *t _{L} and X*

$${\mathrm{\Psi}}_{t}(s)={\mathbb{P}}_{t}\{X\le s\}\phantom{\rule{0.16667em}{0ex}}{\left({\int}_{a}^{s}f(x){\mathbb{P}}_{{t}_{L}}(dx)\right)}^{2}+{\mathbb{P}}_{t}\{X>s\}\phantom{\rule{0.16667em}{0ex}}{\left({\int}_{s}^{b}f(x){\mathbb{P}}_{{t}_{R}}(dx)\right)}^{2}.$$

(4)

*If f*(*s*) *is continuous over t and* * _{t} has a continuous and positive density over t with respect to Lebesgue measure, then the maximizer of (*4

$$2f(s)={\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)+{\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{R}}(dx).$$

(5)

*This solution is not always unique and is permissible only if a* ≤ *s* ≤ *b*.

In order to justify our infinite sample approach, we now state sufficient conditions for *ŝ _{N}* to converge to the population split-point. However, because the population split-point may not be unique or even permissible according to Theorem 1, we need to impose conditions to ensure a well defined solution. We shall assume that Ψ

Notice in the following result we have removed the requirement that *f* is continuous and replaced it with the lighter condition of square-integrability. Additionally, we only require that * _{t}* satisfies a positivity condition over its support.

*Assume that f* ^{2}(* _{t}*)

$${\widehat{s}}_{N}\stackrel{\mathrm{p}}{\to}{s}_{\infty}=\underset{a\le s\le b}{\text{argmax}}{\mathrm{\Psi}}_{t}(s).$$

*Note that s*_{∞}
*is unique.*

In this section, we look at Theorems 1 and 2 in detail by focusing on the class of polynomial functions. Implications of these findings to other types of functions are explored in Section 2.3. We begin by noting that an explicit solution to (5) exists when *f* is polynomial if *X* is assumed to be uniform.

*Suppose that*
$f(x)={c}_{0}+{\sum}_{j=1}^{q}{c}_{j}{x}^{j}$*. If* * _{t} is the uniform distribution on t* = [

$$\sum _{j=0}^{q}({U}_{j}+{V}_{j}-2{c}_{j})\phantom{\rule{0.16667em}{0ex}}{s}^{j}=0,$$

(6)

*where U _{j}* =

$${\mathrm{\Psi}}_{t}(s)=\frac{1}{(b-a)(s-a)}{\left(\sum _{j=0}^{q}\frac{{c}_{j}}{j+1}({s}^{j+1}-{a}^{j+1})\right)}^{2}+\frac{1}{(b-a)(b-s)}{\left(\sum _{j=0}^{q}\frac{{c}_{j}}{j+1}({b}^{j+1}-{s}^{j+1})\right)}^{2}.$$

(7)

As a first illustration, suppose that *f*(*x*) = *c*_{0} + *c*_{1}*x* for *x* [*a, b*]. Then, *U*_{0} = *c*_{0} + *ac*_{1}*/*2, *V*_{0} = *c*_{0} + *bc*_{1}/2 and *U*_{1} = *V*_{1} = *c*_{1}/2. Hence (6) equals

$$\frac{{c}_{1}}{2}(a+b)-{c}_{1}s=0.$$

If *c*_{1} ≠ 0, then *s* = (*a* + *b*)/2; which is a permissible solution. Therefore for simple slope-intercept functions, node-splits are always at the midpoint.

Now consider a more complicated polynomial, *f*(*x*) = 2*x*^{3} − 2*x*^{2} − *x* where *x* [−3, 3]. We numerically solve (6) and (7). The solutions are displayed recursively in Figure 2. The first panel is the optimal split over the root node [−3, 3]. There is one distinct solution *s* = −1.924. The second panel is the optimal split over the daughters arising from the first panel. The third panel are the optimal splits arising from the second panel, and so forth.

Theoretical split-points for *X* under weighted variance splitting (displayed using vertical red lines) for *f*(*x*) = 2*x*^{3} − 2*x*^{2} − *x* (in blue) assuming a uniform [−3, 3] distribution for *X*.

The derivative of *f* is *f*′(*x*) = 6*x*^{2} − 4*x* − 1. Inspection of the derivative shows that *f* is increasing most rapidly for −3 ≤ *x* ≤ −2, followed by 2 ≤ *x* ≤ 3, and then −2 < *x* < 2. The order of splits in Figure 2 follows this pattern, showing that node splitting tracks the curvature of *f*, with splits occurring first in regions where *f* is steepest, and last in places where *f* is flattest.

Our examples have assumed a one-dimensional (*p* = 1) scenario. To test how well our results extrapolate to higher dimensions we modified Example 2 as follows. We simulated *n* = 1000 values from

$${Y}_{i}=f({X}_{i})+{C}_{1}\sum _{k=1}^{d}{U}_{i,k}+{C}_{2}\sum _{k=d+1}^{D}{U}_{i,k}+{\epsilon}_{i},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1\dots ,n,$$

(8)

using *f* as in Example 2, where (*ε _{i}*)

The near-exactness of the split-points of Figures 2 and and33 is a direct consequence of Theorem 2. To see why, note that with some rearrangement, (7) becomes

$${\mathrm{\Psi}}_{t}(s)=(s-a)\phantom{\rule{0.16667em}{0ex}}{\left(\sum _{j=0}^{q}{A}_{j}{s}^{j}\right)}^{2}+(b-s)\phantom{\rule{0.16667em}{0ex}}{\left(\sum _{j=0}^{q}{B}_{j}{s}^{j}\right)}^{2},$$

where *A _{j}, B_{j}* are constants that depend on

To further amplify this point, Figure 4 illustrates how Ψ_{t}_{′}(*s*) depends on *t*′ for *f*(*x*) of Example 2. The first subpanel displays Ψ* _{t}*(

The contiguous regions in Figure 4 (panels 3,4 and 5) were chosen to match the stationary points of Ψ* _{t}* (see panel 2). Stationary points identify points of inflection and maxima of Ψ

We now argue in general, regardless of whether *f* is a polynomial, that the maximum of Ψ* _{t}* depends heavily on the curvature of

$${\mathbb{P}}_{t}\{X={\alpha}_{k}\}=\frac{1}{K},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}a\le {\alpha}_{1}<{\alpha}_{2}<\cdots <{\alpha}_{K}\le b.$$

It follows (this expression holds for all *f*):

$${\mathrm{\Psi}}_{t}(s)=\frac{1}{K{\sum}_{{\alpha}_{k}\le s}\phantom{\rule{0.16667em}{0ex}}}{\left(\sum _{{\alpha}_{k}\le s}f({\alpha}_{k})\right)}^{2}+\frac{1}{K{\sum}_{{\alpha}_{k}>s}\phantom{\rule{0.16667em}{0ex}}}{\left(\sum _{{\alpha}_{k}>s}f({\alpha}_{k})\right)}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}s\in \mathcal{X}.$$

(9)

Maximizing (9) results in a split-point *s*_{∞} such that the squared sum of *f* is large either to the left of *s*_{∞} or right of *s*_{∞} (or both). For example, if there is a contiguous region where *f* is substantially high, then Ψ* _{t}* will be maximized at the boundary of this region.

As a simple illustration, consider the step function *f*(*x*) = **1**_{{}_{x>}_{1/2}} where *x* [0, 1]. Then,

$${\mathrm{\Psi}}_{t}(s)=\{\begin{array}{ll}{\left({\sum}_{{\alpha}_{k}>1/2}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}/\left(K{\sum}_{{\alpha}_{k}>s}\phantom{\rule{0.16667em}{0ex}}\right)\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}s\le {\scriptstyle \frac{1}{2}}\hfill \\ {\left({\sum}_{1/2<{\alpha}_{k}\le s}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}/\left(K{\sum}_{{\alpha}_{k}\le s}\phantom{\rule{0.16667em}{0ex}}\right)+{\left({\sum}_{{\alpha}_{k}>s}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}/\left(K{\sum}_{{\alpha}_{k}>s}\phantom{\rule{0.16667em}{0ex}}\right)\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}s>{\scriptstyle \frac{1}{2}}.\hfill \end{array}$$

When *s* ≤ 1/2, the maximum of Ψ* _{t}* is achieved at the largest value of

$${\mathrm{\Psi}}_{t}({\alpha}^{-})=\frac{{\left({\displaystyle \sum _{{\alpha}_{k}>1/2}}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}}{K{\displaystyle \sum _{{\alpha}_{k}>{\alpha}^{-}}}\phantom{\rule{0.16667em}{0ex}}}=\frac{{\left({\displaystyle \sum _{{\alpha}_{k}\ge {\alpha}^{+}}}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}}{K{\displaystyle \sum _{{\alpha}_{k}\ge {\alpha}^{+}}}\phantom{\rule{0.16667em}{0ex}}}=\frac{{\displaystyle \sum _{{\alpha}_{k}\ge {\alpha}^{+}}}\phantom{\rule{0.16667em}{0ex}}}{K}.$$

The following bound holds when *s* ≥ ^{+} > 1/2:

$${\mathrm{\Psi}}_{t}(s)<\frac{{\left({\displaystyle \sum _{1/2<{\alpha}_{k}\le s}}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}}{K{\displaystyle \sum _{{\alpha}^{+}\le {\alpha}_{k}\le s}}\phantom{\rule{0.16667em}{0ex}}}+\frac{{\left({\displaystyle \sum _{{\alpha}_{k}>s}}\phantom{\rule{0.16667em}{0ex}}\right)}^{2}}{K{\displaystyle \sum _{{\alpha}_{k}>s}}\phantom{\rule{0.16667em}{0ex}}}=\frac{{\displaystyle \sum _{{\alpha}^{+}\le {\alpha}_{k}\le s}}\phantom{\rule{0.16667em}{0ex}}}{K}+\frac{{\displaystyle \sum _{{\alpha}_{k}>s}}\phantom{\rule{0.16667em}{0ex}}}{K}=\frac{{\displaystyle \sum _{{\alpha}_{k}\ge {\alpha}^{+}}}\phantom{\rule{0.16667em}{0ex}}}{K}={\mathrm{\Psi}}_{t}({\alpha}^{-}).$$

Therefore the optimal split point is *s*_{∞} = *α*^{−}: this is the value in the support of *X* closest to the point where *f* has the greatest increase; namely *s* = 1/2. Importantly, observe that *s*_{∞} coincides with a change in the sign of the derivative of Ψ* _{t}*. This is because Ψ

As further illustration that Ψ* _{t}* depends on the curvature of

Data optimized split-points *ŝ*_{N} for X (in red) using weighted variance splitting for Blocks, Bumps, HeaviSine and Doppler simulations (Donoho and Johnstone, 1994). True functions are displayed in blue.

We end this section by noting evidence of ECP splitting occurring in Figure 5. For example, for Blocks and Bumps, splits are observed near the edges 0 and 1 even though Ψ* _{t}* has no singularities there. This occurs, because once the tree finds the discernible boundaries of the spiky points in Bumps and jumps in the step functions of Blocks (by discernible we mean signal is larger than noise), it has exhausted all informative splits, and so it begins to split near the edges. This is an example of ECP splitting, a topic we discuss next.

Example 1 showed that weighted variance splits at the midpoint for simple linear functions *f*(*x*) = *c*_{0} + *c*_{1}*x*. This midpoint splitting behavior for a strong variable is in contrast to what happens for noisy variables. Consider when *f* is a constant, *f*(*x*) = *c*_{0}. This is the limit as *c*_{1} → 0 and corresponds to *X* being a noisy variable. One might think weighted variance splitting will continue to favor midpoint splits, since this would be the case for arbitrarily small *c*_{1}, but it will be shown that edgesplits are favored in this setting. As discussed earlier, this behavior is referred to as the ECP property.

*A splitting rule has the ECP property if it tends to split near the edge for a noisy variable. In particular, let ŝ _{N} be the optimized split-point for the variable X with candidate split-points x_{1}* <

To establish the ECP property for weighted variance splitting, first note that Theorem 1 will not help in this instance. The solution (5) is

$$2{c}_{0}={c}_{0}+{c}_{0},$$

which holds for all *s*. The solution is indeterminate because Ψ* _{t}*(

$${\mathrm{\Psi}}_{t}(s)=\frac{{c}_{0}^{2}{\sum}_{{\alpha}_{k}\le s}\phantom{\rule{0.16667em}{0ex}}}{K}+\frac{{c}_{0}^{2}{\sum}_{{\alpha}_{k}>s}\phantom{\rule{0.16667em}{0ex}}}{K}={c}_{0}^{2}.$$

The solution is again indeterminate because Ψ* _{t}*(

To establish the ECP property we will use a large sample result due to Breiman et al. (Chapter 11.8; 1984). First, observe that (2) can be written as

$$\widehat{D}(s,t)=\frac{1}{N}\sum _{i\in {t}_{L}}{({Y}_{i}-{\overline{Y}}_{{t}_{L}})}^{2}+\frac{1}{N}\sum _{i\in {t}_{R}}{({Y}_{i}-{\overline{Y}}_{{t}_{R}})}^{2}\phantom{\rule{0ex}{0ex}}=\frac{1}{N}\sum _{i\in t}{Y}_{i}^{2}-\frac{{N}_{L}}{N}{\overline{Y}}_{{t}_{L}}^{2}-\frac{{N}_{R}}{N}{\overline{Y}}_{{t}_{R}}^{2}.$$

Therefore minimizing (*s, t*) is equivalent to maximizing

$$\frac{1}{{N}_{L}}{\left(\sum _{i\in {t}_{L}}{Y}_{i}\right)}^{2}+\frac{1}{{N}_{R}}{\left(\sum _{i\in {t}_{R}}{Y}_{i}\right)}^{2}.$$

(10)

Consider the following result (see Theorem 10 for a generalization of this result).

*(Theorem 11.1*; Breiman et al., 1984*). Let* (*Z _{i}*)

$${\xi}_{N,m}=\frac{1}{m}{\left(\sum _{i=1}^{m}{Z}_{i}\right)}^{2}+\frac{1}{N-m}{\left(\sum _{i=m+1}^{N}{Z}_{i}\right)}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}1\le m\le N-1.$$

(11)

*Then for any* 0 < *δ* < 1/2 *and any* 0 < *τ* < ∞:

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}{\xi}_{N,m}>\underset{N\delta <m<N(1-\delta )}{max}\tau {\xi}_{N,m}\right\}=1$$

(12)

*and*

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{N(1-\delta )\le m\le N}{max}{\xi}_{N,m}>\underset{N\delta <m<N(1-\delta )}{max}\tau {\xi}_{N,m}\right\}=1.$$

(13)

Theorem 4 shows (11) will favor edge splits almost surely. To see how this applies to (10), let us assume *X* is noisy. By Definition 1, this implies that the distribution of *Y* given **X** does not depend on *X*, and therefore *Y _{i}*

Weighted variance splitting possesses the ECP property.

Weighted variance splitting determines the best split by minimizing the weighted sample variance using weights proportional to the daughter sample sizes. We introduce a different type of splitting rule that avoids the use of weights. We refer to this new rule as unweighted variance splitting. The unweighted variance splitting rule is defined as

$${\widehat{D}}_{U}(s,t)=\widehat{\mathrm{\Delta}}({t}_{L})+\widehat{\mathrm{\Delta}}({t}_{R}).$$

(14)

The best split is found by minimizing * _{U}*(

$${\widehat{D}}_{U}(s,t)=\frac{1}{{N}_{L}}\sum _{i\in {t}_{L}}{Y}_{i}^{2}+\frac{1}{{N}_{R}}\sum _{i\in {t}_{R}}{Y}_{i}^{2}-\frac{1}{{N}_{L}^{2}}{\left(\sum _{i\in {t}_{L}}{Y}_{i}\right)}^{2}-\frac{1}{{N}_{R}^{2}}{\left(\sum _{i\in {t}_{R}}{Y}_{i}\right)}^{2}.$$

The following result shows that rules like this, which we refer to as unweighted splitting rules, possess the ECP property.

*Let* (*Z _{i}*)

$${\zeta}_{N,m}=\frac{1}{m}\sum _{i=1}^{m}{Z}_{i}^{2}+\frac{1}{N-m}\sum _{i=m+1}^{N}{Z}_{i}^{2}-\frac{1}{{m}^{2}}{\left(\sum _{i=1}^{m}{Z}_{i}\right)}^{2}-\frac{1}{{(N-m)}^{2}}{\left(\sum _{i=m+1}^{N}{Z}_{i}\right)}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}1\le m\le N-1.$$

(15)

*Then for any* 0 < *δ* < 1/2:

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le N\delta}{min}{\zeta}_{N,m}<\underset{N\delta <m<N(1-\delta )}{min}{\zeta}_{N,m}\right\}=1$$

(16)

*and*

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{N(1-\delta )\le m\le N}{min}{\zeta}_{N,m}<\underset{N\delta <m<N(1-\delta )}{min}{\zeta}_{N,m}\right\}=1.$$

(17)

We will see that unweighted variance splitting has a stronger ECP property than weighted variance splitting. Going in the opposite direction is heavy weighted variance splitting, which weights the node variance using a more aggressive weight. The heavy weighted variance splitting rule is

$${\widehat{D}}_{H}(s,t)=\widehat{p}{({t}_{L})}^{2}\widehat{\mathrm{\Delta}}({t}_{L})+\widehat{p}{({t}_{R})}^{2}\widehat{\mathrm{\Delta}}({t}_{R}).$$

(18)

The best split is found by minimizing * _{H}*(

Unlike weighted and unweighted variance splitting, heavy variance splitting does not possess the ECP property. To show this, rewrite (18) as

$${\widehat{D}}_{H}(s,t)=\frac{{N}_{L}}{{N}^{2}}\sum _{i\in {t}_{L}}{Y}_{i}^{2}+\frac{{N}_{R}}{{N}^{2}}\sum _{i\in {t}_{R}}{Y}_{i}^{2}-\frac{1}{{N}^{2}}{\left(\sum _{i\in {t}_{L}}{Y}_{i}\right)}^{2}-\frac{1}{{N}^{2}}{\left(\sum _{i\in {t}_{R}}{Y}_{i}\right)}^{2}.$$

This is an example of a heavy weighted splitting rule. The following result shows that such rules favor center splits for noisy variables. Therefore they are the greediest in the presence of noise.

*Let* (*Z _{i}*)

$${\phi}_{N,m}=m\sum _{i=1}^{m}{Z}_{i}^{2}+(N-m)\sum _{i=m+1}^{N}{Z}_{i}^{2}-{\left(\sum _{i=1}^{m}{Z}_{i}\right)}^{2}-{\left(\sum _{i=m+1}^{N}{Z}_{i}\right)}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}1\le m\le N-1.$$

(19)

*Then for any* 0 < *δ* < 1/2:

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m<N\delta}{min}{\phi}_{N,m}>\underset{N\delta \le m\le N(1-\delta )}{min}{\phi}_{N,m}\right\}=1$$

(20)

*and*

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{N(1-\delta )<m\le N}{min}{\phi}_{N,m}>\underset{N\delta \le m\le N(1-\delta )}{min}{\phi}_{N,m}\right\}=1.$$

(21)

The previous results show that the ECP property only holds for weighted and unweighted splitting rules, but not heavy weighted splitting rules. For convenience, we summarize the three splitting rules below:

*Splitting rules of the form (*11*) are called weighted splitting rules. Those like (*15*) are called unweighted splitting rules, while those of the form (*19*) are called heavy weighted splitting rules.*

To investigate the differences between our three splitting rules we used the following one-dimensional (*p* = 1) simulation. We simulated *n* = 100 observations from

$${Y}_{i}={c}_{0}+{c}_{1}{X}_{i}+{\epsilon}_{i},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\dots ,n,$$

where *X _{i}* was drawn independently from a uniform distribution on [−3, 3] and

The simulation was repeated 10,000 times independently. The optimized splitpoint *ŝ _{N}* under weighted, unweighted and heavy weighted variance splitting was recorded in each instance. We also recorded

Density for *ŝ*_{N} under weighted variance (black), unweighted variance (red), heavy weighted variance (green) and random splitting (blue) where *f*(*x*) = *c*_{0}+*c*_{1}*x* for *c*_{0} = 1, *c*_{1} = 0 (left:noisy), *c*_{0} = 1, *c*_{1} = 0.5 (middle: weak signal) and *c*_{0} = 1, *c*_{1} = **...**

The example confirms our earlier hypothesis: weighted splitting is the most adaptive. In noisy scenarios it exhibits ECP tendencies but with even moderate signal it shuts off ECP splitting enabling it to recover signal.

We return to Example 4 and investigate the shape of Ψ* _{t}* under the three splitting rules. As before, we assume

$${\mathrm{\Psi}}_{t}(i/n)=\{\begin{array}{ll}\frac{1}{i}{\left({\sum}_{k\le i}f(k/n)\right)}^{2}+\frac{1}{n-i}{\left({\sum}_{i<k}f(k/n)\right)}^{2}\hfill & (\text{weighted})\hfill \\ \frac{1}{{i}^{2}}{\left({\sum}_{k\le i}f(k/n)\right)}^{2}-\frac{1}{i}{\sum}_{k\le i}f{(k/n)}^{2}+\frac{1}{{(n-i)}^{2}}{\left({\sum}_{i<k}f(k/n)\right)}^{2}-\frac{1}{n-i}{\sum}_{i<k}f{(k/n)}^{2}\hfill & (\text{unweighted})\hfill \\ {\left({\sum}_{k\le i}f(k/n)\right)}^{2}-k{\sum}_{k\le i}f{(k/n)}^{2}+{\left({\sum}_{i<k}f(k/n)\right)}^{2}-(n-i){\sum}_{i<k}f{(k/n)}^{2}.\hfill & (\text{heavy})\hfill \end{array}$$

Ψ* _{t}* functions for Blocks, Bumps, HeaviSine and Doppler functions of Example 4 are shown in Figure 8. For weighted splitting, Ψ

The previous analyses looked at *p* = 1 scenarios. Here we consider a more complex *p* > 1 simulation as in (8). To facilitate this analysis, it will be helpful to define an ECP statistic to quantify the closeness of a split to an edge. Let *ŝ _{N}* be the optimized split for the variable

$$\text{ecp}({\widehat{s}}_{N})=\frac{1}{2}-\frac{min\phantom{\rule{0.16667em}{0ex}}\left\{N-1-j({\widehat{s}}_{N}),\phantom{\rule{0.38889em}{0ex}}j({\widehat{s}}_{N})-1\right\}}{N-1}.$$

The ECP statistic is motivated by the following observations. The closest that *ŝ _{N}* can be to the right most split is when

*n* = 1000 values were sampled from (8) using 25 noise variables (thus increasing the previous *D* = 13 to *D* = 35). Figure 9 displays ecp(*ŝ _{N}*) values as a function of node depth for

For weighted splitting (top panel), ECP values are high for*X* near −1 and 1.5. This is because the observed values of *Y* are relatively constant in the range [−1, 1.5] which causes splits to occur relatively infrequently in this region, similar to Figure 3, and endcut splits to occur at its edges. Almost all splits occur in [−3, −1) and (1.5, 3] where *Y* is non-linear in *X*, and many of these occur at relatively small depths, reflecting a strong *X* signal in these regions. For *U*_{1}, ECP behavior is generally uniform, although there is evidence of ECP splitting at the edges. The uniform behavior is expected, because *U*_{1} contributes a linear term to *Y,* thus favoring splits at the midpoint, while edge splits occur because of the moderate signal: after a sufficient number of splits, *U*_{1}’s signal is exhausted and the tree begins to split at its edge. For the noisy variable, strong ECP behavior occurs near the edges −3 and 3.

Unweighted splitting (second row) exhibits aggressive ECP behavior for *X* across much of its range (excluding [−1, 1.5], where again splits of any kind are infrequent). The predominate ECP behavior indicates that unweighted splitting has difficulty in discerning signal. Note the large node depths due to excessive end-cut splitting. For *U*_{1}, splits are more uniform but there is aggressive ECP behavior at the edges. Aggressive ECP behavior is also seen at the edges for the noisy variable. Heavy weighted splitting (third row) registers few large ECP values and ECP splitting is uniform for the noisy variable. Node depths are smaller compared to the other two rules.

The bottom panel displays results for restricted weighted splitting. Here weighted splitting was applied, but candidate split values *x*_{1} < · · · < *x _{N}* were restricted to

To look more closely at the issue of split-depth, Table 1 displays the average depth at which a variable splits for the first time. This statistic has been called minimal depth by Ishwaran et al. (2010, 2011) and is useful for assessing a variable’s importance. Minimal depth for unweighted splitting is excessively large so we focus on the other rules. Focusing on weighted, restricted weighted, and heavy weighted splitting, we find minimal depth identical for *X*, while minimal depth for linear variables are roughly the same, although heavy weighted splitting’s value is smallest—which is consistent with the rules tendency to split towards the center, which favors linearity. Over noise variables, minimal depth is largest for weighted variance splitting. It’s ECP property produces deeper trees which pushes splits for noise variables down the tree. It is notable how much larger this minimal depth is compared with the other two rules—and in particular, restricted weighting. Therefore, combining the results of Table 1 with Figure 9, we can conclude that restricted weighted splitting is closest to weighted splitting, but differs by its inability to produce ECP splits. Because of this useful feature, we will use restricted splitting in subsequent analyses to assess the benefit of the ECP property.

We used a large benchmark analysis to further assess the different splitting rules. In total, we used 36 data sets of differing size *n* and dimension *p* (Table 2). This included real data (in capitals) and synthetic data (in lower case). Many of the synthetic data were obtained from the
`mlbench` R-package (Leisch and Dimitriadou, 2009) (e.g., data sets listed in Table 2 starting with “friedman” are the class of Friedman simulations included in the package). The entry “simulation.8” is simulation (8) just considered. A RF regression (RF-R) analysis was applied to each data set using parameters (*ntree, mtry, nodesize*) = (1000, [*p*/3]^{+}, 5) where [*z*]^{+} rounds *z* to the first largest integer. Weighted variance, unweighted variance, heavy weighted variance and pure random splitting rules were used for each data set. Additionally, we used the restricted weighted splitting rule described in the previous section (*δ* = .20). Meansquared- error (MSE) was estimated using 10-fold cross-validation. In order to facilitate comparison of MSE across data, we standardized MSE by dividing by the sample variance of *Y*. All computations were implemented using the
`randomForestSRC` R-package (Ishwaran and Kogalur, 2014).

MSE performance of RF-R under different splitting rules. MSE was estimated using 10-fold validation and has been standardized by the sample variance of Y and multiplied by 100.

To systematically compare performance we used univariate and multivariate non-parametric statistical tests described in Demsar (2006). To compare two splitting rules we used the Wilcoxon signed rank test applied to the difference of their standardized MSE values. To test for an overall difference among the various procedures we used the Iman and Davenport modified Friedman test (Demsar, 2006). The exact p-value for the Wilcoxon signed rank test are recorded along the upper diagonals of Table 3. The lower diagonal values record the corresponding test statistic where small values indicate a difference. The diagonal values of the table record the average rank of each procedure and were used for the Friedman test.

Performance of RF-R under different splitting rules. Upper diagonal values are Wilcoxon signed rank p-values comparing two procedures; lower diagonal values are the corresponding test statistic. Diagonal values record overall rank.

The modified Friedman test of equality of ranks yielded a p-value < 0.00001, thus providing strong evidence of difference between the methods. Overall, weighted splitting had the best overall rank, followed by restricted weighted splitting, unweighted splitting, heavy weighted splitting, and finally pure random splitting. To compare performance of weighted splitting to each of the other rules, based on the p-values in Table 3, we used the Hochberg step-down procedure (Demsar, 2006) which controls for multiple testing. Under a familywise error rate (FWER) of 0.05, the test rejected the null hypothesis that performance of weighted splitting was equal to one of the other methods. This demonstrates superiority of weighted splitting. Other points worth noting in Table 3 are that while unweighted splitting’s overall rank is better than heavy weighted splitting, the difference appears marginal and considering Table 2 we see there is no clear winner. In moderate-dimensional problems unweighted splitting is generally better, while heavy weighted splitting is sometimes better in high dimensions. The high-dimensional scenario is interesting and we discuss this in more detail below (Section 2.9.1). Finally, it is clearly evident from Table 3 that pure random splitting is substantially worse than all other rules. Considering Table 2, we find its performance deteriorates as *p* increases. One exception is “noise” which is a synthetic data set with all noisy variables: all methods perform similarly here. In general, its performance is on par with other rules only when *n* is large and *p* is small (e.g. CMB data).

Figure 10 displays the average number of nodes by tree depth for each splitting rule. We observe the following patterns:

- Heavy weighted splitting (green) yields the most symmetric node distribution. Because it does not possess the ECP property, and splits near the middle, it grows shallower balanced trees.
- Unweighted splitting (red) yields the most skewed node distribution. It has the strongest ECP property and has the greatest tendency to split near the edge. Edge splitting promotes unbalanced deep trees.
- Random (blue), weighted (black), and restricted weighted (magenta) splitting have node distributions that fall between the symmetric distributions of heavy weighted splitting and the skewed distributions of unweighted splitting. Due to suppression of ECP splits, restricted weighted splitting is the least skewed of the three and is closest to heavy weighted splitting, whereas weighted splitting due to ECP splits is the most skewed of the three and closest to unweighted splitting.

To investigate performance differences in high dimensions, we ran the following two additional simulations. In the first, we simulated *n* = 250 observations from the linear model

$${Y}_{i}={C}_{0}+{C}_{1}{X}_{i}+{C}_{2}\sum _{k=1}^{d}{U}_{i,k}+{\epsilon}_{i},$$

(22)

where (*ε _{i}*)

The same forest parameters were used as in Table 2. To investigate the effect of dimensionality, we varied the total number of variables in small increments. The left panel of Figure 11 presents the results for (22). Unweighted splitting has poor performance in this example, possible due to its overly strong ECP property. Restricted weighted splitting is slightly better than heavy weighted splitting, but weighted splitting has the best performance and its relative performance compared with heavy weighted and restricted weighted splitting increases with *p*. As we have discussed, we can attribute these gains as a direct consequence of ECP splitting. The right panel of Figure 11 presents the results for “friedman2.bigp”. Interestingly the results are similar, although MSE values are far smaller due to the strong non-linear signal.

Now we consider the effect of splitting in multiclass problems. As before, the learning data is = (**X*** _{i}, Y_{i}*)

We study splitting under the Gini index, a widely used CART splitting rule for classification. Let * _{j}*(

$$\widehat{\mathrm{\Gamma}}(t)=\sum _{j=1}^{J}{\widehat{\varphi}}_{j}(t)(1-{\widehat{\varphi}}_{j}(t)).$$

As before, Let *t _{L}* and

$$\widehat{\mathrm{\Gamma}}({t}_{L})=\sum _{j=1}^{J}{\widehat{\varphi}}_{j}({t}_{L})(1-{\widehat{\varphi}}_{j}({t}_{L})),$$

where * _{j}*(

$$\widehat{\mathrm{\Gamma}}(s,t)=\widehat{\mathrm{\Gamma}}(t)-\left[\widehat{p}({t}_{L})\widehat{\mathrm{\Gamma}}({t}_{L})+\widehat{p}({t}_{R})\widehat{\mathrm{\Gamma}}({t}_{R})\right].$$

The quantity

$$\widehat{G}(s,t)=\widehat{p}({t}_{L})\widehat{\mathrm{\Gamma}}({t}_{L})+\widehat{p}({t}_{R})\widehat{\mathrm{\Gamma}}({t}_{R})$$

is the Gini index. To achieve a good split, we seek the split-point maximizing the decrease in node impurity: equivalently we can minimize *Ĝ*(*s*, *t*) with respect to *s*. Notice that because the Gini index weights the node impurity by the node size, it can be viewed as the analog of the weighted variance splitting criterion (2).

To theoretically derive *ŝ _{N}*, we again consider an infinite sample paradigm. In place of

$$G(s,t)=p({t}_{L})\mathrm{\Gamma}({t}_{L})+p({t}_{R})\mathrm{\Gamma}({t}_{R}),$$

(23)

where Γ(*t _{L}*) and Γ(

$$\mathrm{\Gamma}({t}_{L})=\sum _{j=1}^{J}{\varphi}_{j}({t}_{L})(1-{\varphi}_{j}({t}_{L})),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\mathrm{\Gamma}({t}_{R})=\sum _{j=1}^{J}{\varphi}_{j}({t}_{R})(1-{\varphi}_{j}({t}_{R}))$$

where *ϕ _{j}*(

The following is the analog of Theorem 1 for the two-class problem.

*Let ϕ*(*s*) = {*Y* = 1|*X* = *s*}*. If ϕ*(*s*) *is continuous over t* = [*a*, *b*] *and* * _{t} has a continuous and positive density over t with respect to Lebesgue measure, then the value for s that minimizes (*23

$$2\varphi (s)={\int}_{a}^{s}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)+{\int}_{s}^{b}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{R}}(dx),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}a\le s\le b.$$

(24)

Theorem 8 can be used to determine the optimal Gini split in terms of the underlying target function, *ϕ*(*x*). Consider a simple intercept-slope model

$$\varphi (x)={(1+exp(-f(x)))}^{-1}.$$

(25)

Assume * _{t}* is uniform and that

$$2{c}_{1}\varphi (s)=\frac{1}{s-a}log\phantom{\rule{0.16667em}{0ex}}\left(\frac{1-\varphi (a)}{1-\varphi (s)}\right)+\frac{1}{b-s}log\phantom{\rule{0.16667em}{0ex}}\left(\frac{1-\varphi (s)}{1-\varphi (b)}\right).$$

Unlike the regression case, the solution cannot be derived in closed form and does not equal the midpoint of the interval [*a*, *b*].

It is straightforward to extend Theorem 2 to the classification setting, thus justifying the use of an infinite sample approximation. The square-integrability condition will hold automatically due to boundedness of *ϕ*(*s*). Therefore only the positive support condition for * _{t}* and the existence of a unique maximizer for Ψ

$${\left({\mathbb{P}}_{t}\{X\le s\}\right)}^{-1}{\left({\int}_{a}^{s}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}+{\left({\mathbb{P}}_{t}\{X>s\}\right)}^{-1}{\left({\int}_{s}^{b}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}.$$

Under these conditions it can be shown that *ŝ _{N}* converges to the unique population split-point,

Breiman (1996) also investigated optimal split-points for classification splitting rules. However, these results are different than ours. He studied the question of what configuration of class frequencies yields the optimal split for a given splitting rule. This is different because it does not involve the classification rule and therefore does not address the question of what is the optimal split-point for a given *ϕ*(*x*). The optimal split-point studied in Breiman (1996) may not even be realizable.

We show that Gini splitting possesses the ECP property. Noting that

$$\widehat{\mathrm{\Gamma}}({t}_{L})=\sum _{j=1}^{J}{\widehat{\varphi}}_{j}({t}_{L})(1-{\widehat{\varphi}}_{j}({t}_{L}))=1-\sum _{j=1}^{J}{\widehat{\varphi}}_{j}{({t}_{L})}^{2},$$

and that $\widehat{\mathrm{\Gamma}}({t}_{R})=1-{\sum}_{j=1}^{J}{\widehat{\varphi}}_{j}{({t}_{R})}^{2}$, we can rewrite the Gini index as

$$\widehat{G}(s,t)=\frac{{N}_{L}}{N}\left(1-\sum _{j=1}^{J}\frac{{N}_{j,L}^{2}}{{N}_{L}^{2}}\right)+\frac{{N}_{R}}{N}\left(1-\sum _{j=1}^{J}\frac{{N}_{j,R}^{2}}{{N}_{R}^{2}}\right),$$

where *N _{j}*

$$\sum _{j=1}^{J}\frac{{N}_{j,L}^{2}}{{N}_{L}}+\sum _{j=1}^{J}\frac{{N}_{j,R}^{2}}{{N}_{R}}.$$

(26)

In the two-class problem, *J* = 2, it can be shown this is equivalent to maximizing

$$\frac{{N}_{1,L}^{2}}{{N}_{L}}+\frac{{N}_{1,R}^{2}}{{N}_{R}}=\frac{1}{{N}_{L}}{\left(\sum _{i\in {t}_{L}}{\mathbf{1}}_{\{{Y}_{i}=1\}}\right)}^{2}+\frac{1}{{N}_{R}}{\left(\sum _{i\in {t}_{R}}{\mathbf{1}}_{\{{Y}_{i}=1\}}\right)}^{2},$$

which is a member of the class of weighted splitting rules (11) required by Theorem 4 with *Z _{i}* =

This shows Gini splitting has the ECP property when *J* = 2, but we now show that the ECP property applies in general for *J* ≥ 2. The optimization problem (26) can be written as

$$\sum _{j=1}^{J}\left[\frac{1}{{N}_{L}}{\left(\sum _{i\in {t}_{L}}{Z}_{i(j)}\right)}^{2}+\frac{1}{{N}_{R}}{\left(\sum _{i\in {t}_{R}}{Z}_{i(j)}\right)}^{2}\right]$$

where *Z _{i}*

$${\xi}_{N,m,j}=\frac{1}{m}{\left(\sum _{i=1}^{m}{Z}_{i(j)}\right)}^{2}+\frac{1}{N-m}{\left(\sum _{i=m+1}^{N}{Z}_{i(j)}\right)}^{2}.$$

We compare the Gini index for an edge split to a non-edge split. Let

$${j}^{\ast}=\underset{1\le j\le J}{\text{argmax}}{\xi}_{N,j},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}{\xi}_{N,j}=\underset{N\delta <m<N(1-\delta )}{max}{\xi}_{N,m,j}.$$

For a left-edge split

$$\mathbb{P}\phantom{\rule{0.16667em}{0ex}}\left\{\underset{1\le m\le N\delta}{max}\left\{\sum _{j=1}^{J}{\xi}_{N,m,j}\right\}>\underset{N\delta <m<N(1-\delta )}{max}\left\{\sum _{j=1}^{J}{\xi}_{N,m,j}\right\}\right\}\phantom{\rule{0ex}{0ex}}\ge \mathbb{P}\left\{\underset{1\le m\le N\delta}{max}\left\{\sum _{j=1}^{J}{\xi}_{N,m,j}\right\}>J{\xi}_{N,{j}^{\ast}}\right\}\phantom{\rule{0ex}{0ex}}=\sum _{{j}^{\prime}=1}^{J}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}\left\{\sum _{j=1}^{J}{\xi}_{N,m,j}\right\}>J{\xi}_{N,{j}^{\prime}},\phantom{\rule{0.38889em}{0ex}}{j}^{\ast}={j}^{\prime}\right\}\phantom{\rule{0ex}{0ex}}\ge \sum _{j=1}^{J}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}{\xi}_{N,m,j}>J{\xi}_{N,j},\phantom{\rule{0.38889em}{0ex}}{j}^{\ast}=j\right\}.$$

Apply Theorem 4 with *τ* = *J* to each of the *J* terms separately. Let *A _{n}*

The Gini index possesses the ECP property.

Analogous to unweighted variance splitting, we define an unweighted Gini index splitting rule as follows

$${\widehat{G}}_{U}(s,t)=\widehat{\mathrm{\Gamma}}({t}_{L})+\widehat{\mathrm{\Gamma}}({t}_{R}).$$

(27)

Similar to unweighted variance splitting, the unweighted Gini index splitting rule possesses a strong ECP property.

For brevity we prove that (27) has the ECP property in two-class problems. Notice that we can rewrite (27) as follows

$$\frac{1}{2}{\widehat{G}}_{U}(s,t)=\left(\frac{{N}_{1,L}}{{N}_{L}}-\frac{{N}_{1,L}^{2}}{{N}_{L}^{2}}\right)+\left(\frac{{N}_{1,R}}{{N}_{R}}-\frac{{N}_{1,R}^{2}}{{N}_{R}^{2}}\right)\phantom{\rule{0ex}{0ex}}=\frac{1}{{N}_{L}}\sum _{i\in {t}_{L}}{Z}_{i}^{2}+\frac{1}{{N}_{R}}\sum _{i\in {t}_{R}}{Z}_{i}^{2}-\frac{1}{{N}_{L}^{2}}{\left(\sum _{i\in {t}_{L}}{Z}_{i}\right)}^{2}-\frac{1}{{N}_{R}^{2}}{\left(\sum _{i\in {t}_{R}}{Z}_{i}\right)}^{2},$$

where *Z _{i}* =

We also define a heavy weighted Gini index splitting rule as follows

$${\widehat{G}}_{H}(s,t)=\widehat{p}{({t}_{L})}^{2}\widehat{\mathrm{\Gamma}}({t}_{L})+\widehat{p}{({t}_{R})}^{2}\widehat{\mathrm{\Gamma}}({t}_{R}).$$

Similar to heavy weighted splitting in regression, heavy weighted Gini splitting does not possess the ECP property. When *J* = 2, this follows directly from Theorem 7 by observing that

$$\frac{1}{2}{\widehat{G}}_{H}(s,t)=\frac{1}{{N}^{2}}({N}_{L}{N}_{1,L}-{N}_{1,L}^{2})+\frac{1}{{N}^{2}}({N}_{R}{N}_{1,R}-{N}_{1,R}^{2})\phantom{\rule{0ex}{0ex}}=\frac{{N}_{L}}{{N}^{2}}\sum _{i\in {t}_{L}}{Z}_{i}^{2}+\frac{{N}_{R}}{{N}^{2}}\sum _{i\in {t}_{R}}{Z}_{i}^{2}-\frac{1}{{N}^{2}}{\left(\sum _{i\in {t}_{L}}{Z}_{i}\right)}^{2}-\frac{1}{{N}^{2}}{\left(\sum _{i\in {t}_{R}}{Z}_{i}\right)}^{2},$$

which is a member of the heavy weighted splitting rules (19) with ${Z}_{i}={Z}_{i}^{2}={\mathbf{1}}_{\{{Y}_{i}=1\}}$.

To investigate the differences between the Gini splitting rules we used the following one-dimensional two-class simulation. We simulated *n* = 100 observations for *ϕ*(*x*) specified as in (25) where *f*(*x*) = *c*_{0} + *c*_{1}*x* and *X* was uniform [−3, 3]. We considered noisy, moderate signal, and strong signal scenarios, similar to our regression analysis of Figure 7. The experiment was repeated 10,000 times independently.

Figure 12 reveals a pattern similar to Figure 7. Once again, weighted splitting is the most adaptive. It exhibits ECP tendencies, but in the presence of even moderate signal it shuts off ECP splitting. Unweighted splitting is also adaptive but with a more aggressive ECP behavior.

To further assess differences in the splitting rules we ran a large benchmark analysis comprised of 36 data sets of varying dimension and number of classes (Table 4). As in our regression benchmark analysis of Table 2, real data sets are indicated with capitals and synthetic data in lower case. The latter were all obtained from the mlbench R-package (Leisch and Dimitriadou, 2009). A RF classification (RF-C) analysis was applied to each data set using the same forest parameters as Table 2. Pure random splitting as well as weighted, unweighted and heavy weighted Gini splitting was employed. Restricted Gini splitting, defined as in the regression case, was also used (*δ* = .20).

Brier score performance (×100) of RF-C under different splitting rules. Performance was estimated using 10-fold validation. To interpret the Brier score, the benchmark value of 25 represents the performance of a random classifier.

Performance was assessed using the Brier score (Brier, 1950) and estimated by 10-fold cross-validation. Let _{i}_{,}* _{j}* := (

$$\text{Brier}\phantom{\rule{0.16667em}{0ex}}\text{Score}=\frac{1}{J\mid \mathcal{J}\mid}\sum _{i\in \mathcal{J}}\sum _{j=1}^{J}{\left({\mathbf{1}}_{\{{Y}_{i}=j\}}-{\widehat{p}}_{i,j}\right)}^{2}.$$

The Brier score was used rather than misclassification error because it directly measures accuracy in estimating the true conditional probability {*Y* = *j*|**X**}. We are interested in the true conditional probability because a method that is consistent for estimating this value is immediately Bayes risk consistent but not vice-versa. See Gyorfi et al. (Theorem 1.1, 2002).

Tables 4 and and55 reveal patterns consistent with Tables 2 and and3.3. As in Table 2, random splitting is consistently poor with performance degrading with increasing *p*. The rank of splitting rules in Table 5 is consistent with Table 3, however statistical significance of pairwise comparisons are not as strong. The Hochberg step-down procedure comparing weighted splitting to each of the other methods did not reject the null hypothesis of equality between between weighted and unweighted splitting at a 0.05 FWER, however increasing the FWER to 16%, which matches the observed p-value for unweighted splitting, led to all hypotheses being rejected. The modified Friedman test of difference in ranks yielded a p-value < 0.00001, thus indicating a strong difference in performance of the methods. We can conclude that splitting rules generally exhibit the same performance as in the regression setting, but performance gains for weighted splitting are not as strong.

Regarding the issue of dimensionality, there appears to be no winner over the high-dimensional examples in Table 4: aging, brain, colon, leukemia, lymphoma, prostate and srbct. However, these are all microarray data sets and this could simply be an artifact of this type of data. To further investigate how *p* affects performance, we added noise variables to mlbench synthetic data sets (Figure 13). The dimension was increased systematically in each instance. We also included a linear model simulation similar to (22) with *ϕ*(*x*) specified as in (25) (see top left panel, “linear.bigp”). Figure 13 shows that when performance differences exist between rules, weighted splitting and unweighted splitting, which possess the ECP property, generally outperform restricted weighted and heavy weighted splitting. Furthermore, there is no example where these latter rules outperform weighted splitting.

Our results have shown that pure random splitting is rarely as effective as adaptive splitting. It does not possess the ECP property, nor does it adapt to signal. On the other hand, randomized rules are desirable because they are computationally efficient. Therefore as a means to improve computational efficiency, while maintaining adaptivity of a split-rule, we consider randomized adaptive splitting. In this approach, in place of deterministic splitting in which the splitting rule is calculated for the entire set of *N* available split-points for a variable, the splitting rule is confined to a set of split-points indexed by *I _{N}* {1, …,

For brevity, we confine our analysis to the class of weighted splitting rules. Deterministic (non-random) splitting seeks the value 1 ≤ *m* ≤ *N* − 1 maximizing (11). In contrast, randomized adaptive splitting maximizes the split-rule by restricting *m* to *I _{N}*. The optimal split-point is determined by maximizing the restricted splitting rule:

$${\xi}_{N,m}^{r}=\frac{1}{m}{\left(\sum _{i=1}^{m}{Z}_{N,i}\right)}^{2}+\frac{1}{{R}_{N}-m}{\left(\sum _{i=m+1}^{{R}_{N}}{Z}_{N,i}\right)}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}1\le m\le {R}_{N}-1,$$

(28)

where *R _{N}* = |

In principle, *I _{N}* can be selected in any manner. The method we will study empirically selects

*Let* (*Z _{i}*)

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le {R}_{N}\delta}{max}{\xi}_{N,m}^{r}>\underset{{R}_{N}\delta <m<{R}_{N}(1-\delta )}{max}\tau {\xi}_{N,m}^{r}\right\}=1$$

(29)

*and*

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{{R}_{N}(1-\delta )\le m\le {R}_{N}}{max}{\xi}_{N,m}^{r}>\underset{{R}_{N}\delta <m<{R}_{N}(1-\delta )}{max}\tau {\xi}_{N,m}^{r}\right\}=1.$$

(30)

As a special case, Theorem 10 yields Theorem 4 for the sequence *I _{N}* = {1, …,

Theorem 10 shows that the ECP property holds if *nsplit* → ∞. Because any rate is possible, the condition is mild and gives justification for *nsplit*-randomization. However, notice that *nsplit* = 1, corresponding to the extremely randomized tree method of Geurts et al. (2006), does not satisfy the rate condition.

To demonstrate the effectiveness of randomized adaptive splitting, we re-ran the RF-R benchmark analysis of Section 2. All experimental parameters were kept the same. Randomized weighted splitting was implemented using *nsplit* = 1, 5, 10. Performance values are displayed in Table 6 based on the Wilcoxon signed rank test and overall rank of a procedure.

Table 6 shows that the rank of a procedure improves steadily with increasing *nsplit*. The modified Friedman test of equality of ranks rejects the null (p-value < 0.00001) while the Hochberg step-down procedure, which tests equality of weighted splitting to each of the other methods, cannot reject the null hypothesis of performance equality between weighted and randomized weighted splitting for *nsplit* = 10 at any reasonable FWER. This demonstrates the effectiveness of *nsplit*-randomization. Table 7 displays the results from applying *nsplit*-randomization to the classification analysis of Table 4. The results are similar to Table 6 (modified Friedman test p-value < 0.00001; step-down procedure did not reject equality between weighted and randomized weighted for *nsplit* = 10).

For brevity we have presented results of *nsplit*-randomization only in the context of weighted splitting, but we have observed that the properties of all our splitting rules remain largely unaltered under randomization: randomized unweighted variance splitting maintains a more aggressive ECP behavior, while randomized heavy weighted splitting does not exhibit the ECP property at all.

Of the various splitting rules considered, the class of weighted splitting rules, which possess the ECP property, performed the best in our empirical studies. The ECP property, which is the property of favoring edge-splits, is important because it conserves the sample size of a parent node under a bad split. Bad splits generally occur for noisy variables but they can also occur for strong variables (for example, the parent node may be in a region of the feature space where the signal is low). On the other hand, non-edge splits are important when strong signal is present. Good splitting rules therefore have the ECP behavior for noisy or weak variables, but split away from the edge when there is strong signal.

Weighted splitting has this optimality property. In noisy scenarios it exhibits ECP tendencies, but in the presence of signal, it can shut off ECP splitting. To understand how this adaptivity arises, we found that optimal splits under weighted splitting occur in the contiguous regions defined by the singularity points of the population optimization function Ψ* _{t}*—thus, weighted splitting tracks the underlying true target function. To illustrate this point, we looked carefully at Ψ

Randomized adaptive splitting is an attractive compromise to deterministic (non-randomized) splitting. It is computationally efficient and yet does not disrupt the adaptive properties of a splitting rule. The ECP property can be guaranteed under fairly weak conditions. Pure random splitting, however, is not recommended. Its lack of adaptivity and non-ECP behavior yields inferior performance in almost all instances except large sample settings with low dimensionality. Although large sample consistency and asymptotic properties of forests have been investigated under the assumption of pure random splitting, these results show that such studies mist be viewed only as a first (but important) step to understanding forests. Theoretical analysis of forests under adaptive splitting rules is challenging, yet future theoretical investigations which consider such rules are anticipated to yield deeper insight into forests.

While CART weighted variance splitting and Gini index splitting are known to be equivalent (Wehenkel, 1996), many RF users may not be aware of their interchangeability: our work reveals both are examples of weighted splitting and therefore share similar properties (in the case of two-class problems, they are equivalent). Related to this is work by Malley et al. (2012) who considered probability machines, defined as learning machines which estimate the conditional probability function for a binary outcome. They outlined advantages of treating two-class data as a nonparametric regression problem rather than as a classification problem. They described a RF regression method to estimate the conditional probability—an example of a probability machine. In place of Gini index splitting they used weighted variance splitting and found performance of the modified RF procedure to compare favorably to boosting, *k*-nearest neighbors, and bagged nearest neighbors. Our results which have shown a connection between the two types of splitting rules sheds light on these findings.

Dr. Ishwaran’s work was funded in part by DMS grant 1148991 from the National Science Foundation and grant R01CA163739 from the National Cancer Institute. The author gratefully acknowledges three anonymous referees whose reviews greatly improved the manuscript.

Let * _{ε}* denote the measure for

$${\mathbb{P}}_{{t}_{L}}(A)=\frac{\mathbb{P}\{A,X\le s,X\in t\}}{\mathbb{P}\{X\le s,X\in t\}}=\frac{{\mathbb{P}}_{t}\{A,X\le s\}}{{\mathbb{P}}_{t}\{X\le s\}}=\underset{A\cap [a,s]}{\int}\frac{{\mathbb{P}}_{t}(dx)}{{\mathbb{P}}_{t}\{X\le s\}}.$$

(31)

Setting *Y* = *f*(*X*) + *ε*, it follows that

$$p({t}_{L})\mathrm{\Delta}({t}_{L})={\mathbb{P}}_{t}\{X\le s\}\text{Var}(Y\mid X\le s,X\in t)\phantom{\rule{0ex}{0ex}}={\mathbb{P}}_{t}\{X\le s\}\phantom{\rule{0.16667em}{0ex}}\left[\mathbb{E}({Y}^{2}\mid X\le s,X\in t)-\mathbb{E}{(Y\mid X\le s,X\in t)}^{2}\right]\phantom{\rule{0ex}{0ex}}={\mathbb{P}}_{t}\{X\le s\}\int \int {(f(x)+\epsilon )}^{2}{\mathbb{P}}_{{t}_{L}}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )-{\mathbb{P}}_{t}\{X\le s\}\phantom{\rule{0.16667em}{0ex}}{\left(\int \int (f(x)+\epsilon )\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )\right)}^{2}\phantom{\rule{0ex}{0ex}}=\int {\int}_{a}^{s}{(f(x)+\epsilon )}^{2}{\mathbb{P}}_{t}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )-{\left({\mathbb{P}}_{t}\{X\le s\}\right)}^{-1}{\left(\int {\int}_{a}^{s}(f(x)+\epsilon )\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )\right)}^{2},$$

where we have used (31) in the last line. Recall that (*ε*) = 0 and (*ε*^{2}) = *σ*^{2}. Hence

$$\int {\int}_{a}^{s}{(f(x)+\epsilon )}^{2}{\mathbb{P}}_{t}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )={\int}_{a}^{s}f{(x)}^{2}{\mathbb{P}}_{t}(dx)+{\sigma}^{2}{\mathbb{P}}_{t}\{X\le s\}$$

and

$$\int {\int}_{a}^{s}(f(x)+\epsilon )\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )={\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx).$$

Using a similar argument for *p*(*t _{R}*)Δ(

$$D(s,t)={\int}_{a}^{b}f{(x)}^{2}{\mathbb{P}}_{t}(dx)+{\sigma}^{2}-{\left({\mathbb{P}}_{t}\{X\le s\}\right)}^{-1}{\left({\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}-{\left({\mathbb{P}}_{t}\{X>s\}\right)}^{-1}{\left({\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}.$$

(32)

We seek to minimize *D*(*s, t*). However, if we drop the first two terms in (32), multiply by −1, and rearrange the resulting expression, it suffices to maximize Ψ* _{t}*(

$${\mathrm{\Psi}}_{t}(s)={\mathbb{P}}_{t}{\{X\le s\}}^{-1}{\left({\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}+{\mathbb{P}}_{t}{\{X>s\}}^{-1}{\left({\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}.$$

The assumption that *f*(*s*) is continuous ensures that the above integrals are continuous and differentiable over *s* [*a, b*] by the fundamental theorem of calculus. Another application of the fundamental theorem of calculus, making use of the assumption * _{t}* has a continuous and positive density, ensures that

Let *h*(*s*) denote the density for * _{t}*. For

$$\frac{\partial}{\partial s}{\mathrm{\Psi}}_{t}(s)=2f(s)h(s){\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)-h(s)\phantom{\rule{0.16667em}{0ex}}{\left({\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)\right)}^{2}-2f(s)h(s){\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{R}}(dx)+h(s)\phantom{\rule{0.16667em}{0ex}}{\left({\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{R}}(dx)\right)}^{2}.$$

Keeping in mind our assumption *h*(*s*) *>* 0, the two possible solutions that make the above derivative equal to zero are (5) and

$${\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)={\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{R}}(dx).$$

(33)

Because Ψ* _{t}*(

We will show that the maximizer for Ψ* _{t}*(

$${\mathrm{\Psi}}_{t}(a)={\mathrm{\Psi}}_{t}(b)\phantom{\rule{0ex}{0ex}}={\left({\int}_{a}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}\phantom{\rule{0ex}{0ex}}={\left({\mathbb{P}}_{t}\{X\le s\}{\int}_{a}^{s}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{L}}(dx)+{\mathbb{P}}_{t}\{X>s\}{\int}_{s}^{b}f(x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{{t}_{R}}(dx)\right)}^{2}\phantom{\rule{0ex}{0ex}}\le {\mathrm{\Psi}}_{t}(s),$$

where the last line holds for any *a < s < b* due to Jensen’s inequality. Moreover, the inequality is strict with equality occurring only when (33) holds. Thus, the maximizer for Ψ* _{t}*(

Let *, X*_{1}*,..., X _{N}* be i.i.d. with distribution

$$\widehat{p}({t}_{L})=\frac{1}{N}\sum _{i=1}^{N}{\mathbf{1}}_{\{{X}_{i}\le s\}}\stackrel{\mathrm{a}.\mathrm{s}.}{\to}\mathbb{P}\{\stackrel{\sim}{X}\le s\}=\mathbb{P}\{X\le s\mid X\in t\}.$$

(34)

Next we apply the strong law of large numbers to (*t _{L}*). First note that

$$\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\left({\mathbf{1}}_{\{\stackrel{\sim}{X}\le s\}}{Y}^{2}\right)=\int {\int}_{a}^{s}{(f(x)+\epsilon )}^{2}{\mathbb{P}}_{t}(dx)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{\epsilon}(d\epsilon )\phantom{\rule{0ex}{0ex}}={\int}_{a}^{s}f{(x)}^{2}{\mathbb{P}}_{t}(dx)+{\sigma}^{2}{\mathbb{P}}_{t}\{X\le s\}.$$

The right-hand side is finite because *σ*^{2}
*<* ∞ and *f*^{2} is integrable (both by assumption). A similar argument shows that (**1**_{{}_{}_{≤}_{s}_{}}*Y*) *<* ∞. Appealing once again to the strong law of large numbers, deduce that for *s* (*a, b*)

$$\widehat{\mathrm{\Delta}}({t}_{L})=\frac{{\sum}_{i=1}^{N}{\mathbf{1}}_{\{{X}_{i}\le s\}}{Y}_{i}^{2}}{{\sum}_{i=1}^{N}{\mathbf{1}}_{\{{X}_{i}\le s\}}}-{\left(\frac{{\sum}_{i=1}^{N}{\mathbf{1}}_{\{{X}_{i}\le s\}}{Y}_{i}}{{\sum}_{i=1}^{N}{\mathbf{1}}_{\{{X}_{i}\le s\}}}\right)}^{2}\phantom{\rule{0ex}{0ex}}\stackrel{\mathrm{a}.\mathrm{s}.}{\to}\frac{\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\left({\mathbf{1}}_{\{\stackrel{\sim}{X}\le s\}}{Y}^{2}\right)}{\mathbb{P}\{\stackrel{\sim}{X}\le s\}}-{\left(\frac{\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\left({\mathbf{1}}_{\{\stackrel{\sim}{X}\le s\}}Y\right)}{\mathbb{P}\{\stackrel{\sim}{X}\le s\}}\right)}^{2}\phantom{\rule{0ex}{0ex}}=\mathbb{E}({Y}^{2}\mid X\le s,X\in t)-{\left(\mathbb{E}(Y\mid X\le s,X\in t)\right)}^{2},$$

where we have used that the denominators in the above expression are strictly positive by our positivity assumption for * _{t}*. Noting that the last line above equals Var(

$$\widehat{p}({t}_{L})\widehat{\mathrm{\Delta}}({t}_{L})\stackrel{\mathrm{a}.\mathrm{s}.}{\to}\mathbb{P}\{X\le s\mid X\in t\}\text{Var}(Y\mid X\le s,X\in t).$$

The above convergence can be shown to be uniform on compact sets [*a*′*, b*′] (*a, b*) by appealing to a uniform law of large numbers. For example, the Glivenko-Cantelli theorem immediately guarantees that convergence of (34) is uniform over [*a, b*]. See Chapter 2 of Pollard (1984) for background on uniform convergence of empirical measures. Applying a symmetrical argument for the right daughter node *t _{R}*, deduce that

$$\widehat{D}(s,t)\stackrel{\mathrm{a}.\mathrm{s}.}{\to}D(s,t),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{uniformly}\phantom{\rule{0.16667em}{0ex}}\text{on}\phantom{\rule{0.16667em}{0ex}}\text{compacta}.$$

The minimizer of *D*(*s, t*) is equivalent to the maximizer of Ψ* _{t}*(

By Theorem 1, and using the fact that * _{t}* is a uniform distribution, the global minimum to (3) is the solution to

$$2f(s)=F(a,s)+F(s,b),$$

(35)

where
$F(\alpha ,\beta )={\int}_{\alpha}^{\beta}f(x)\phantom{\rule{0.16667em}{0ex}}dx/(\beta -\alpha )$ for *a* ≤ *α < β* ≤ *b*. Multiply the right-hand side by (*s* − *a*)(*b* − *s*), and substituting *f*(*x*) and solving, yields

$$(b-s)\phantom{\rule{0.16667em}{0ex}}\left(\sum _{j=0}^{q}\frac{{c}_{j}}{j+1}({s}^{j+1}-{a}^{j+1})\right)+(s-a)\phantom{\rule{0.16667em}{0ex}}\left(\sum _{j=0}^{q}\frac{{c}_{j}}{j+1}({b}^{j+1}-{s}^{j+1})\right).$$

Divide by (*s* − *a*)(*b* − *s*). Deduce that the right-hand side is

$$\sum _{j=0}^{q}\frac{{a}^{j}{c}_{j}}{j+1}(1+\cdots +{u}^{j})+\sum _{j=0}^{q}\frac{{b}^{j}{c}_{j}}{j+1}(1+\cdots +{v}^{j}),$$

where *u* = *s/a* and *v* = *s/b* (if *a* = 0 the identity continues to hold under the convention that 0* ^{j}/*0

To determine which solution from (35) minimizes (3), choose that value which maximizes (4). Algebraic manipulation allows one to express (4) as (7).

The following is a slightly modified version of the proof given in Breiman et al. (1984). We provide a proof not only for the convenience of the reader, but also because parts of the proof will be reused later.

To start, we first show there is no loss of generality in assuming (*Z*_{1}) = 0. Let
${S}_{m}={\sum}_{i=1}^{m}({Z}_{i}-\mu )$ and
${S}_{m}^{\ast}={\sum}_{i=m+1}^{N}({Z}_{i}-\mu )$ where *μ* = (*Z*_{1}). Then

$${\xi}_{N,m}=\frac{1}{m}{({S}_{m}+m\mu )}^{2}+\frac{1}{N-m}{({S}_{m}^{\ast}+(N-m)\mu )}^{2}\phantom{\rule{0ex}{0ex}}=\frac{1}{m}{S}_{m}^{2}+\frac{1}{N-m}{S}_{m}^{\ast 2}+2\mu ({S}_{m}+{S}_{m}^{\ast})+N{\mu}^{2}$$

which is equivalent to maximizing

$$\frac{1}{m}{S}_{m}^{2}+\frac{1}{N-m}{S}_{m}^{\ast 2}=\frac{1}{m}{\left(\sum _{i=1}^{m}({Z}_{i}-\mu )\right)}^{2}+\frac{1}{N-m}{\left(\sum _{i=m+1}^{N}({Z}_{i}-\mu )\right)}^{2}.$$

Therefore, we can assume (*Z*_{1}) = 0. Hence,
${S}_{m}={\sum}_{i=1}^{m}{Z}_{i},{S}_{m}^{\ast}={\sum}_{i=m+1}^{N}{Z}_{i}$ and
${\xi}_{N,m}={S}_{m}^{2}/m+{S}_{m}^{\ast 2}/(N-m)$. Let *C >* 0 be an arbitrary constant. Kolmogorov’s inequality asserts that for independent variables (*U _{i}*)

$$\mathbb{P}\left\{\underset{1\le m\le n}{max}\left|\sum _{1\le i\le m}{U}_{i}\right|\ge C\right\}\le \frac{1}{{C}^{2}}\sum _{1\le i\le n}\mathbb{E}({U}_{i}^{2}).$$

Let
${\sigma}^{2}=\mathbb{E}({Z}_{1}^{2})$. Because *Z _{i}* are independent with mean zero, deduce that

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}\left(\frac{\tau {S}_{m}^{2}}{m}\right)\ge \frac{{\sigma}^{2}}{\delta C}\right\}\le \mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}{S}_{m}^{2}\ge \frac{N\delta {\sigma}^{2}}{\tau \delta C}\right\}\phantom{\rule{0ex}{0ex}}\le \frac{\tau C}{N{\sigma}^{2}}\sum _{1\le i\le N(1-\delta )}\mathbb{E}({Z}_{i}^{2})\phantom{\rule{0ex}{0ex}}\le \tau C.$$

Similarly,

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\tau {S}_{m}^{\ast 2}}{N-m}\right)\ge \frac{{\sigma}^{2}}{\delta C}\right\}\le \frac{\tau C}{N{\sigma}^{2}}\sum _{N\delta +1\le i\le N}\mathbb{E}({Z}_{i}^{2})\le \tau C.$$

Therefore,

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}\tau {\xi}_{N,m}\ge \frac{2{\sigma}^{2}}{\delta C}\right\}\le 2\tau C.$$

(36)

Let ${L}_{m}=\sqrt{mlog(logm)}$. By the law of the iterated logarithm (LIL) (Hartman and Wintner, 1941)

$$\underset{m\to \infty}{limsup}{\left(\frac{{S}_{m}}{{L}_{m}}\right)}^{2}=2{\sigma}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{almost}\phantom{\rule{0.16667em}{0ex}}\text{surely},$$

which implies that for any 0 *< θ <* 2 and any integer *m*_{0}
*>* 2

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}{\left(\frac{{S}_{m}}{{L}_{m}}\right)}^{2}>\theta {\sigma}^{2}\right\}=1.$$

Hence for *m*_{0} chosen such that *δC* log(log*m*_{0}) *>* 2*/θ*

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}{\xi}_{N,m}>\frac{2{\sigma}^{2}}{\delta C}\right\}\phantom{\rule{0ex}{0ex}}\ge \underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{{S}_{m}^{2}}{m}\right)>\frac{2{\sigma}^{2}}{\delta C}\right\}\phantom{\rule{0ex}{0ex}}\ge \underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{{S}_{m}^{2}}{mlog(log({m}_{0}))}\right)>\frac{2{\sigma}^{2}}{\delta Clog(log({m}_{0}))}\right\}\phantom{\rule{0ex}{0ex}}\ge \underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}{\left(\frac{{S}_{m}}{{L}_{m}}\right)}^{2}>\theta {\sigma}^{2}\right\}\phantom{\rule{0ex}{0ex}}=1.$$

(37)

Because *C* can be made arbitrarily small, deduce from (37) and (36) that (12) holds. A symmetrical argument yields (13).

We will assume (*Z*_{1}) = 0 and later show that the assumption holds without loss of generality. Let
${\sigma}^{2}=\mathbb{E}({Z}_{1}^{2})$. With a little bit of rearrangement we obtain

$$-\sqrt{N}{\zeta}_{N,m}=-2\sqrt{N}{\sigma}^{2}+{A}_{N,m}+{B}_{N,m}$$

where

$${A}_{N,m}=\frac{\sqrt{N}}{m}\sum _{i=1}^{m}{\stackrel{\sim}{Z}}_{i}+\frac{\sqrt{N}}{N-m}\sum _{i=m+1}^{N}{\stackrel{\sim}{Z}}_{i},$$

${\stackrel{\sim}{Z}}_{i}={\sigma}^{2}-{Z}_{i}^{2}$ are i.i.d. with mean zero, and

$${B}_{N,m}=\frac{\sqrt{N}}{{m}^{2}}{\left(\sum _{i=1}^{m}{Z}_{i}\right)}^{2}+\frac{\sqrt{N}}{{(N-m)}^{2}}{\left(\sum _{i=m+1}^{N}{Z}_{i}\right)}^{2}.$$

We will maximize *A _{N,m}*+

We begin with *B _{N,m}*. We consider its behavior away from an edge. Let
${S}_{m}={\sum}_{i=1}^{m}{Z}_{i}$ and
${S}_{m}^{\ast}={\sum}_{i=m+1}^{N}{Z}_{i}$. Arguing as in the proof of Theorem 4, we have for any

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N}{S}_{m}^{2}}{{m}^{2}}\right)\ge \frac{{\sigma}^{2}}{{\delta}^{2}C}\right\}\le \frac{{\delta}^{2}C\sqrt{N}}{{(N\delta )}^{2}{\sigma}^{2}}\sum _{1\le i\le N(1-\delta )}\mathbb{E}({Z}_{i}^{2})\le \frac{C}{\sqrt{N}}.$$

Applying a similar argument for ${S}_{m}^{\ast 2}/{(N-m)}^{2}$, deduce that

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}{B}_{N,m}\ge \frac{2{\sigma}^{2}}{{\delta}^{2}C}\right\}\le \frac{2C}{\sqrt{N}}.$$

Therefore we have established that

$$\underset{N\delta <m<N(1-\delta )}{max}{B}_{N,m}={O}_{p}(1/\sqrt{N}).$$

(38)

Now consider *A _{N,m}*. We first consider its behavior away from an edge. Let
${\stackrel{\sim}{\sigma}}^{2}=\mathbb{E}({\stackrel{\sim}{Z}}_{1}^{2})$, which is finite by our assumption
$\mathbb{E}({Z}_{1}^{4})<\infty $. Let
${\stackrel{\sim}{S}}_{m}={\sum}_{i=1}^{m}{\stackrel{\sim}{Z}}_{i}$ and
${\stackrel{\sim}{S}}_{m}^{\ast}={\sum}_{i=m+1}^{N}{\stackrel{\sim}{Z}}_{i}$. Let

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N}{\stackrel{\sim}{S}}_{m}}{m}\right)\ge \frac{\stackrel{\sim}{\sigma}}{\delta C}\right\}\phantom{\rule{0ex}{0ex}}\le \mathbb{P}\left\{\underset{N\delta \le m<N(1-\delta )}{max}\mid {\stackrel{\sim}{S}}_{m}\mid \phantom{\rule{0.16667em}{0ex}}\ge \frac{\sqrt{N}\stackrel{\sim}{\sigma}}{C}\right\}\phantom{\rule{0ex}{0ex}}\le \frac{{C}^{2}N(1-\delta ){\stackrel{\sim}{\sigma}}^{2}}{N{\stackrel{\sim}{\sigma}}^{2}}\phantom{\rule{0ex}{0ex}}\le {C}^{2}.$$

Using a similar argument for ${\stackrel{\sim}{S}}_{m}^{\ast}/(N-m)$,

$$\mathbb{P}\left\{\underset{N\delta <m<N(1-\delta )}{max}{A}_{N,m}\ge \frac{2\stackrel{\sim}{\sigma}}{\delta C}\right\}\le 2{C}^{2}.$$

(39)

Now we consider the behavior of *A _{N,m}* near an edge. As in the proof of Theorem 4, let
${L}_{m}=\sqrt{mlog(logm)}$. Choose
$0<\theta <\sqrt{2}$ and let

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{{r}_{m}{\stackrel{\sim}{S}}_{m}}{{L}_{m}}\right)>\theta \stackrel{\sim}{\sigma}\right\}\phantom{\rule{0ex}{0ex}}\ge \underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{{\stackrel{\sim}{S}}_{m}}{{L}_{m}}\right)>\theta \stackrel{\sim}{\sigma}\right\}\phantom{\rule{0ex}{0ex}}=1.$$

(40)

We will need a bound for the following quantity

$${\mathrm{\Omega}}_{N}^{\ast}=\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N}\mid {\stackrel{\sim}{S}}_{m}^{\ast}\mid}{N-m}\right).$$

By Kolmogorov’s inequality, for any constant *K >*0,

$$\mathbb{P}\left\{{\mathrm{\Omega}}_{N}^{\ast}>K\right\}\le \mathbb{P}\left\{\underset{{m}_{0}\le m<N\delta}{max}\mid {\stackrel{\sim}{S}}_{m}^{\ast}\mid \phantom{\rule{0.16667em}{0ex}}\ge \sqrt{N}(1-\delta )K\right\}\phantom{\rule{0ex}{0ex}}\le \frac{N\delta {\stackrel{\sim}{\sigma}}^{2}}{N{(1-\delta )}^{2}{K}^{2}}\phantom{\rule{0ex}{0ex}}\le \frac{2{\stackrel{\sim}{\sigma}}^{2}}{{K}^{2}}.$$

(41)

The following lower bounds hold:

$$\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}{A}_{N,m}>\frac{2\stackrel{\sim}{\sigma}}{\delta C}\right\}\phantom{\rule{0ex}{0ex}}=\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N}{\stackrel{\sim}{S}}_{m}}{m}+\frac{\sqrt{N}{\stackrel{\sim}{S}}_{m}^{\ast}}{N-m}\right)>\frac{2\stackrel{\sim}{\sigma}}{\delta C}\right\}\phantom{\rule{0ex}{0ex}}\ge \mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N\delta}{\stackrel{\sim}{S}}_{m}}{m{l}_{0}}\right)-\frac{\sqrt{\delta}{\mathrm{\Omega}}_{N}^{\ast}}{{l}_{0}}>\frac{2\stackrel{\sim}{\sigma}}{C{l}_{0}\sqrt{\delta}}\right\},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{l}_{0}=\sqrt{log(log{m}_{0})}\phantom{\rule{0ex}{0ex}}\ge \mathbb{P}\left\{\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N\delta}{\stackrel{\sim}{S}}_{m}}{m{l}_{0}}\right)\ge \frac{\sqrt{\delta}{\mathrm{\Omega}}_{N}^{\ast}}{{l}_{0}}+\frac{2\stackrel{\sim}{\sigma}}{C{l}_{0}\sqrt{\delta}}\right\}\cap \left\{{\mathrm{\Omega}}_{N}^{\ast}\le K\right\}\right\}\phantom{\rule{0ex}{0ex}}\ge \mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N\delta}{\stackrel{\sim}{S}}_{m}}{m{l}_{0}}\right)\ge \frac{K\sqrt{\delta}}{{l}_{0}}+\frac{2\stackrel{\sim}{\sigma}}{C{l}_{0}\sqrt{\delta}}\right\}-\mathbb{P}\left\{{\mathrm{\Omega}}_{N}^{\ast}>K\right\}.$$

(42)

The last line follows from (*AB*) = (*A*) − (*AB ^{c}*) ≥ (

$$\frac{K\sqrt{\delta}}{{l}_{0}}+\frac{2\stackrel{\sim}{\sigma}}{C{l}_{0}\sqrt{\delta}}=\frac{1}{\sqrt{log(log{m}_{0})}}\left[K\sqrt{\delta}+\frac{2\stackrel{\sim}{\sigma}}{C\sqrt{\delta}}\right]<\theta \stackrel{\sim}{\sigma}.$$

Then the first term on the last line of (42) is bounded below by

$$\mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{\sqrt{N\delta}{\stackrel{\sim}{S}}_{m}}{m{l}_{0}}\right)>\theta \stackrel{\sim}{\sigma}\right\}\ge \mathbb{P}\left\{\underset{{m}_{0}\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{{\stackrel{\sim}{S}}_{m}}{\sqrt{m}{l}_{0}}\right)>\theta \stackrel{\sim}{\sigma}\right\},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{because}\phantom{\rule{0.16667em}{0ex}}\sqrt{m}\le \sqrt{N\delta},$$

which converges to 1 due to (40) with *r _{m}* =

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}\phantom{\rule{0.16667em}{0ex}}({A}_{N,m}+{B}_{N,m})>\underset{N\delta <m<N(1-\delta )}{max}{A}_{N,m}\right\}\phantom{\rule{0ex}{0ex}}\ge \underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le N\delta}{max}{A}_{N,m}>\underset{N\delta <m<N(1-\delta )}{max}{A}_{N,m}\right\}\phantom{\rule{0ex}{0ex}}=1.$$

(43)

The limits (16) and (17) follow by combining results from above. To prove (16), note by (38) we have

$$\underset{N\delta <m<N(1-\delta )}{max}({A}_{N,m}+{B}_{N,m})\le \underset{N\delta <m<N(1-\delta )}{max}{A}_{N,m}+\underset{N\delta <m<N(1-\delta )}{max}{B}_{N,m}\phantom{\rule{0ex}{0ex}}=\underset{N\delta <m<N(1-\delta )}{max}{A}_{N,m}+{o}_{p}(1).$$

Combining this with (43) yields (16). The limit (17) follows by symmetry. Therefore, this concludes the proof under the assumption (*Z*_{1}) = 0. To show such an assumption holds without loss of generality, let *μ* = (*Z*_{1}) and define

$${S}_{m}=\sum _{i=1}^{m}({Z}_{i}-\mu ),\phantom{\rule{0.38889em}{0ex}}{S}_{m}^{\ast}=\sum _{i=m+1}^{N}({Z}_{i}-\mu ),\phantom{\rule{0.38889em}{0ex}}{T}_{m}=\sum _{i=1}^{m}{({Z}_{i}-\mu )}^{2},\phantom{\rule{0.38889em}{0ex}}{T}_{m}^{\ast}=\sum _{i=m+1}^{N}{({Z}_{i}-\mu )}^{2}.$$

Rewrite *ζ _{N,m}* as follows

$${\zeta}_{N,m}=\frac{1}{m}\sum _{i=1}^{m}{({Z}_{i}-\mu +\mu )}^{2}+\frac{1}{N-m}\sum _{i=m+1}^{N}{({Z}_{i}-\mu +\mu )}^{2}-\frac{1}{{m}^{2}}{\left(\sum _{i=1}^{m}({Z}_{i}-\mu )+m\mu \right)}^{2}-\frac{1}{{(N-m)}^{2}}{\left(\sum _{i=m+1}^{N}({Z}_{i}-\mu )+(N-m)\mu \right)}^{2}.$$

Simplifying, it follows that

$${\zeta}_{N,m}=\frac{1}{m}{T}_{m}+\frac{1}{N-m}{T}_{m}^{\ast}-\frac{1}{{m}^{2}}{S}_{m}^{2}-\frac{1}{{(N-m)}^{2}}{S}_{m}^{\ast 2}$$

and therefore *μ* = 0 can be assumed without loss of generality.

We can assume without loss of generality that (*Z*_{1}) = 0 (the proof is similar to the proof used for Theorem 6 given above). Let
${\sigma}^{2}=\mathbb{E}({Z}_{1}^{2})$. Some rearrangement yields

$$-\frac{1}{N}{\phi}_{N,m}+N{\sigma}^{2}={A}_{N,m}+{B}_{N,m}+{C}_{N,m}$$

where *A _{N,m}* = −

$${B}_{N,m}=\frac{m}{N}\sum _{i=1}^{m}{\stackrel{\sim}{Z}}_{i}+\frac{N-m}{N}\sum _{i=m+1}^{N}{\stackrel{\sim}{Z}}_{i},$$

${\stackrel{\sim}{Z}}_{i}={\sigma}^{2}-{Z}_{i}^{2}$ are i.i.d. with mean zero and finite variance ${\stackrel{\sim}{\sigma}}^{2}=\mathbb{E}({\stackrel{\sim}{Z}}_{1}^{2})$ (finiteness holds by our assumption of a fourth moment), and

$${C}_{N,m}=\frac{1}{N}{\left(\sum _{i=1}^{m}{Z}_{i}\right)}^{2}+\frac{1}{N}{\left(\sum _{i=m+1}^{N}{Z}_{i}\right)}^{2}.$$

In place of minimizing * _{N,m}* we will maximize

$$\underset{N/2-1\le m\le N/2+1}{max}{A}_{N,m}\gg \underset{1\le m\le N}{max}\mid {B}_{N,m}\mid +\underset{1\le m\le N}{max}{C}_{N,m}.$$

The result will follow from the asymptotic behavior of *A _{N,m}*.

For brevity we only provide a sketch of the proof since many of the technical details are similar to that used in the proof of Theorem 6. We start with a bound for *C _{N,m}*. By the LIL

$$\underset{1\le m\le N}{max}\frac{1}{N}{\left(\sum _{i=1}^{m}{Z}_{i}\right)}^{2}\le \underset{1\le m\le N}{max}{\left(\frac{1}{\sqrt{m}}\sum _{i=1}^{m}{Z}_{i}\right)}^{2}\phantom{\rule{0ex}{0ex}}\asymp 2{\sigma}^{2}log(logN),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{almost}\phantom{\rule{0.16667em}{0ex}}\text{surely}.$$

A similar analysis for the second term in *C _{N,m}*, yields

$$\underset{1\le m\le N}{max}{C}_{N,m}\le {O}_{p}\phantom{\rule{0.16667em}{0ex}}(log(logN)).$$

Now we bound *B _{N,m}*. Applying the LIL

$$\underset{1\le m\le N}{max}\phantom{\rule{0.16667em}{0ex}}\left(\frac{m}{N}\sum _{i=1}^{m}{\stackrel{\sim}{Z}}_{i}\right)\le \sqrt{N}\underset{1\le m\le N}{max}\left|\frac{1}{\sqrt{m}}\sum _{i=1}^{m}{\stackrel{\sim}{Z}}_{i}\right|\phantom{\rule{0ex}{0ex}}\asymp \sqrt{2{\stackrel{\sim}{\sigma}}^{2}Nlog(logN)},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{almost}\phantom{\rule{0.16667em}{0ex}}\text{surely.}$$

Applying a similar analysis for the second term in *B _{N,m}*, deduce that

$$\underset{1\le m\le N}{max}\mid {B}_{N,m}\mid \phantom{\rule{0.16667em}{0ex}}\le {O}_{p}\phantom{\rule{0.16667em}{0ex}}\left(\sqrt{Nlog(logN)}\right).$$

To complete the proof we show that *A _{N,m}* is the dominating term. Collecting terms,

$$\frac{N}{{\sigma}^{2}}{A}_{N,m}=-2{(m-N/2)}^{2}+{N}^{2}/2.$$

The function *g*(*m*) = −2(*m* − *N/*2)^{2} is concave (quadratic) in *m* with a unique maximum at *m* = *N/*2. Furthermore,

$${A}_{N,N/2}=\frac{N{\sigma}^{2}}{2}.$$

Thus, *A _{N,N/}*

For each measurable set *A*

$$\mathbb{P}\{Y=1\mid X\in A,X\in t\}=\frac{\mathbb{P}\{Y=1,X\in A,X\in t\}}{\mathbb{P}\{X\in A,X\in t\}}\phantom{\rule{0ex}{0ex}}=\frac{{\mathbb{P}}_{X}\phantom{\rule{0.16667em}{0ex}}\left[{\mathbf{1}}_{\{X\in A,X\in t\}}{\mathbb{P}}_{Y\mid X}{\mathbf{1}}_{\{Y=1\}}\right]}{\mathbb{P}\{X\in A,X\in t\}}\phantom{\rule{0ex}{0ex}}=\frac{{\mathbb{P}}_{X}\phantom{\rule{0.16667em}{0ex}}\left[{\mathbf{1}}_{\{X\in A,X\in t\}}\varphi (X)\right]}{\mathbb{P}\{X\in A,X\in t\}}\phantom{\rule{0ex}{0ex}}=\frac{{\mathbb{P}}_{t}\phantom{\rule{0.16667em}{0ex}}\left[{\mathbf{1}}_{\{X\in A\}}\varphi (X)\right]}{{\mathbb{P}}_{t}\{X\in A\}}.$$

Because *ϕ*_{1}(*t _{L}*)(1 −

$$\frac{1}{2}p({t}_{L})\mathrm{\Gamma}({t}_{L})\phantom{\rule{0ex}{0ex}}={\mathbb{P}}_{t}\{X\le s\}{\varphi}_{1}({t}_{L})(1-{\varphi}_{1}({t}_{L}))\phantom{\rule{0ex}{0ex}}={\mathbb{P}}_{t}\{X\le s\}\phantom{\rule{0.16667em}{0ex}}\left[\mathbb{P}\{Y=1\mid X\le s,X\in t\}-{\left(\mathbb{P}\{Y=1\mid X\le s,X\in t\}\right)}^{2}\right]\phantom{\rule{0ex}{0ex}}={\mathbb{P}}_{t}\{X\le s\}\phantom{\rule{0.16667em}{0ex}}\left[\frac{{\mathbb{P}}_{t}\phantom{\rule{0.16667em}{0ex}}\left[{\mathbf{1}}_{\{X\le s\}}\varphi (X)\right]}{{\mathbb{P}}_{t}\{X\le s\}}-{\left(\frac{{\mathbb{P}}_{t}\phantom{\rule{0.16667em}{0ex}}\left[{\mathbf{1}}_{\{X\le s\}}\varphi (X)\right]}{{\mathbb{P}}_{t}\{X\le s\}}\right)}^{2}\right]\phantom{\rule{0ex}{0ex}}={\int}_{a}^{s}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)-{\left({\mathbb{P}}_{t}\{X\le s\}\right)}^{-1}{\left({\int}_{a}^{s}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}.$$

Using a similar argument for *p*(*t _{R}*)Γ(

$$\frac{1}{2}G(s,t)={\int}_{a}^{b}\varphi (x){\mathbb{P}}_{t}(dx)-{\left({\mathbb{P}}_{t}\{X\le s\}\right)}^{-1}{\left({\int}_{a}^{s}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}-{\left({\mathbb{P}}_{t}\{X>s\}\right)}^{-1}{\left({\int}_{s}^{b}\varphi (x)\phantom{\rule{0.16667em}{0ex}}{\mathbb{P}}_{t}(dx)\right)}^{2}.$$

(44)

Notice that this has a similar form to (32) with *ϕ*(*x*) playing the role of *f*(*x*) (the first term on the right of (44) and the first two terms on the right of (32) play no role). Indeed, we can simply follow the remainder of the proof of Theorem 1 to deduce the result.

The proof is nearly identical to Theorem 4 except for the modifications required to deal with triangular arrays. Assume without loss of generality that (*Z _{i}*) = 0. Let
${\sigma}^{2}=\mathbb{E}({Z}_{i}^{2}),\phantom{\rule{0.16667em}{0ex}}{S}_{m}={\sum}_{i=1}^{m}{Z}_{N,i}$ and
${S}_{m}^{\ast}={\sum}_{i=m+1}^{{R}_{N}}{Z}_{N,i}$. Splits away from an edge are handled as in Theorem 4 with

$$\mathbb{P}\left\{\underset{{R}_{N}\delta <m<{R}_{N}(1-\delta )}{max}\tau {\xi}_{N,m}^{r}\ge \frac{2{\sigma}^{2}}{\delta C}\right\}\le 2\tau C.$$

(45)

Now we consider the contribution of a split from a left edge split. To do so, we make use of a LIL for weighted sums. We use Theorem 1 of Lai and Wei (1982). Using their notation, we write
${S}_{N}={\sum}_{i=-\infty}^{\infty}{a}_{N,i}{Z}_{i}$, where *a _{N,i}* = 1 for

$$\underset{N\to \infty}{limsup}\frac{{S}_{N}^{2}}{{A}_{N}log(log{A}_{N})}>\theta {\sigma}^{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{almost}\phantom{\rule{0.16667em}{0ex}}\text{surely},$$

where ${A}_{N}={\sum}_{i=-\infty}^{\infty}{a}_{N,i}^{2}={R}_{N}\to \infty $. Now arguing as in the proof of Theorem 4, this implies

$$\underset{N\to \infty}{lim}\mathbb{P}\left\{\underset{1\le m\le {R}_{N}\delta}{max}{\xi}_{N,m}^{r}>\frac{2{\sigma}^{2}}{\delta C}\right\}=1.$$

(46)

Because *C* can be made arbitrarily small, deduce from (46) and (45) that (29) holds. The limit (30) for a right-edge split follows by symmetry.

- Biau G. Analysis of a random forests model. J Machine Learning Research. 2012;13:1063–1095.
- Biau G, Devroye L, Lugosi G. Consistency of random forests and other classifiers. J Machine Learning Research. 2008;9:2039–2057.
- Breiman L. Technical note: some properties of splitting criteria. Machine Learning. 1996;24:41–47.
- Breiman L. Random forests. Machine Learning. 2001;45:5–32.
- Breiman L. Technical Report 670. University of California, Statistics Department; 2004. Consistency for a simple model of random forests.
- Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Belmont, California: 1984.
- Brier GW. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review. 1950;78:1–3.
- Buhlmann P, Yu B. Analyzing bagging. Ann Statist. 2002;30(4):927–961.
- Cutler A, Zhao G. Pert - perfect random tree ensembles. Computing Science and Statistics. 2001;33:490–497.
- Demsar J. Statistical comparisons of classifiers over multiple data sets. J Machine Learning Research. 2006;7:1–30.
- Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning. 2000;40:139–157.
- Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455.
- Geurts P, Ernst D, Wehenkel L. Extremely Randomized Trees. Machine Learning. 2006;63:3–42.
- Genuer R. Variance reduction in purely random forests. J Nonparam Statist. 2012;24(3):543–562.
- Gyorfi L, Kohler M, Krzyzak A, Walk H. A Distribution-Free Theory of Non-parametric Regression. Springer; 2002.
- Hartman P, Wintner A. On the law of the iterated logarithm. Amer J Math. 1941;63:169–176.
- Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008;2:841–860.
- Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. High-dimensional variable selection for survival data. J Amer Stat Assoc. 2010;105:205–217.
- Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for high-dimensional data. Statistical Analysis and Data Mining. 2011;4:115–132.
- Ishwaran H, Kogalur UB. randomForestSRC: Random Forests for Survival, Regression and Classification (RF-SRC) R package version 1.4.0. 2014 http://cran.r-project.org.
- Kohavi R, John G. Wrappers for feature subset selection. Artificial Intelligence. 1997;97:273–324.
- Kim J, Pollard D. Cube root asymptotics. Ann Stat. 1990;18:191–219.
- Lai TL, Wei CZ. A law of the iterated logarithm for double arrays of independent random variables with applications to regression and time series models. Ann Prob. 1982;19:320–335.
- Leisch F, Dimitriadou E. R package version 1.1–6. 2009. mlbench: Machine Learning Benchmark Problems.
- Lin Y, Jeon Y. Random forests and adaptive nearest neighbors. J Amer Statist Assoc. 2006;101:578–590.
- Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform Med. 2012;1:51. doi: 10.3414/ME00-01-0052. [PMC free article] [PubMed] [Cross Ref]
- Morgan JN, Messenger RC. THAID: a Sequential Search Program for the Analysis of Nominal Scale Dependent Variables. Survey Research Center, Institute for Social Research, University of Michigan; 1973.
- Pollard D. Convergence of Stochastic Processes. Springer-Verlag; 1984.
- Stone CJ. Optimal rates of convergence for nonparametric estimators. Ann Stat. 1980;8:1348–1360.
- Torgo L. A study on end-cut preference in least squares regression trees. Progress in Artificial Intelligence Lecture Notes in Computer Science. 2001;2258:104–115.
- Wehenkel L. On uncertainty measures used for decision tree induction. Proceedings of the International Congress on Information Processing and Management of Uncertainty in Knowledge based Systems, IPMU96; 1996. pp. 413–418.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |