Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Sci China Ser A Math. Author manuscript; available in PMC 2010 August 30.
Published in final edited form as:
Sci China Ser A Math. 2009 July; 52(7): 1525–1538.
doi:  10.1007/s11425-009-0102-y
PMCID: PMC2929990

On the rate of convergence of the maximum like-lihood estimator of a k-monotone density


Bounds for the bracketing entropy of the classes of bounded k-monotone functions on [0, A] are obtained under both the Hellinger distance and the Lp(Q) distance, where 1 ≤ p < ∞ and Q is a probability measure on [0, A]. The result is then applied to obtain the rate of convergence of the maximum likelihood estimator of a k-monotone density.

Keywords: bracketing metric entropy, maximum likelihood estimator, Hellinger distance

1 Introduction

A function on (0, ∞) is called k-monotone if (−1)jf(j)(x) is non-negative, non-increasing, and convex for 0 ≤ jk−2 if k ≥ 2, and f is non-negative, non-increasing if k = 1. These functions fill the gap between monotone functions and completely monotone functions. They appear very commonly in nonparametric estimation, such as the Maximum Likelihood Estimator (MLE) in statistics via renewal theory and mixing of uniform distributions. Indeed, k-monotone functions have been studied since at least the 1950s; for example, Williamson[1] gave a characterization of m-monotone functions on (0, ∞) in 1956. In recent years, there has been some interest in statistics regarding this class of functions. We refer to [2] and the references therein for recent results and their statistical applications.

Note that a k-monotone function may not be bounded near t = 0. In order to study the metric entropy, we restrict ourselves to the subclass that consists of only the functions that are continuous at t = 0. We refer to this subclass as the class of k-monotone functions on [0, ∞). We denote by Mk(I) the class of k-monotone functions on I, and by Fk(I) the class of probability densities on I that are k-monotone.

For statistical applications, we wish to estimate the bracketing entropy of Fk(R+) and Mk(R+) under all Lp(Q) distances, where 1 ≤ p < ∞ and Q is any probability measure on R+, and under the Hellinger distance h which is defined by


Recall that the bracketing metric entropy of a function class F under distance ρ is defined as logN[](ε,F,ρ), where logN[](ε,F,ρ) is defined by




It is easy to see that both Fk(R+) and Mk(R+) are bounded under the Hellinger distance. However, they are not compact. Indeed, for any δ<2, we can find infinitely many functions in Fk(R+) with mutual Hellinger distance at least δ. In fact, for any α > 0, the functions pn(t) = 2e−2t are clearly in Fk(R+). For m > n,




Note that the sequence {pn(t)}n≥1 is unbounded near the origin. This suggests us that for the Hellinger distance, we need to restrict ourselves to k-monotone functions whose values are bounded near the origin. However, the sequence {pn(t)}n<0 is uniformly bounded by 1. Thus, this example also indicates that the non-compactness of Fk(R+) under the Hellinger distance is partly due to the fact that the interval is unbounded. Hence, we should also restrict ourselves to bounded intervals. Therefore, in what follows, we consider the subclasses FkB([0,A]) and MkB([0,A]) instead, where FkB(I) and MkB(I) denote the classes of functions that are bounded by B and belong to Fk(I) and Mk(I) respectively.

By changing variables, it is easy to see that


Hence, we only need to consider the case A = 1.

Let us remark that when k = 1, the problem has been studied by Van de Geer[3] based on an earlier work of Birman and Solomjak[4]. For example, it was proved that (see also [5]; Theorem 2.7.5)


for some absolute constant C > 0, and


for some positive constant Cp depending only on p, where 1 ≤ p < ∞, and Q is any probability measure on R+. For simpler proofs, see also [6, 7]. In particular, the iteration method used in [6] is useful in our argument in this paper.

For k > 1, Gao[8] also established the following metric entropy bound for MkB([0,1]):


The method revealed a nice connection between the metric entropy of these function classes and the small ball probability of k-times integrated Brownian motions. However, because for k > 1 the square root of a k-monotone function may not be k-monotone, the metric entropy estimate under L2 distance does not yield an estimate under the Hellinger distance. Furthermore, that method cannot produce any result on bracketing metric entropy. Thus, it cannot be readily used to determine the convergence rate of the MLE of a k-monotone density.

In this paper, we directly estimate the bracketing metric entropy of these function classes under the Hellinger distance and under all Lp(Q) distances, where 1 ≤ p < ∞, and apply these estimates to statistical settings.

Our main tool is the following lemma, which provides a useful method to estimate bracketing entropy. An extension to more general integral operators will appear elsewhere.

Lemma 1. Let F be a class of functions on [0, 1], and G be the class of function on [0, 1] defined by G={0α(x)f(t)dt:fF}, where 0 ≤ α(x) ≤ 1 is any increasing function on [0, 1]. If logN(ε,F,1)ϕ(ε), where || · ||1 stands for the L1 distance under the Lebesgue measure on [0, 1], then

  1. There exists a constant C depending only on p, such that for any probability measure Q on [0, 1]
    where || · ||p,Q is defined by
  2. If we further assume that for all functions in gG, g(x) ≥ δ, then there exists a constant C, such that
    where h(f, g) is defined by (1).

Proof. Let {fi}, 1 ≤ ieϕ(ε), be an ε-net for F in the L1 distance under the Lebesgue measure on [0, 1]. For each i, and f [set membership] F, we can write


Thus, if we define


we have


Note that Gi+ and Gi both consist of non-negative increasing functions bounded by ε. Thus, by (3) we can find e(ε) many ε/ϕ(ε)-brackets (with respect to || · ||p,Q) that cover Gi+Gi. Say these brackets are [hi,j,hi,j], 1 ≤ je(ε). Then clearly the brackets


cover G. Statement (i) follows by noticing that these are ε/ϕ(ε)-brackets under the || · ||p,Q distance.

To prove Statement (ii), we notice that with the additional assumption gδ, we have for any g1, g2 [set membership] G,




Thus, Statement (ii) follows from Statement (i).

2 Under Hellinger distance

Note that by scaling arguments, we can easily show that


Hence, we only need to consider the classes Mk1([0,1]) and FkB([0,1]).

Before we process with the detailed calculation, we make some observations that can simplify the later arguments. Firstly, because k-monotone C functions are dense in Mk1([0,1]) (cf. [8]), we can and will assume that all the densities in Mk1([0,1]) are continuously k-times differentiable. Secondly, if for I = [a, b] [subset or is implied by] [0, 1] we define


then for every fHk1([0,1]),


for all 0 ≤ jk, and for all u [set membership] [0, 1]. For fHk1([0,1]), we can write


All the terms on the right-hand side are non-negative. The sum of the first k – 1 terms is a polynomial of u with degree k – 2, and non-negative coefficients.

2.1 Bounded k-monotone functions

For fHk1([0,1]), because f(1) ≤ 1, we have




and denote by H~kB(I) the class of functions g on I [subset or is implied by] [0, 1] that satisfy 0 ≤ g ≤ 1 and are of the form


where f are non-negative increasing functions on I. By (5), we have Hk1([0,1])Pk+H~k1([0,1]), and thus


Note that the set


forms a kN-net for Pk under the Hellinger distance. Indeed, for any p=a0+a1u++ak2uk2Pm, choose




which implies that h(p,p~)kN. Note that there are no more than Nk–1 elements in this set. By choosing N = [left ceiling]4−2[right ceiling], we obtain


Of course, because pp~2kN, we also have


Our next goal is to estimate N[](ε2,H~k1([0,1]),h). To this end, we first consider


For gH~k1([0,1]), we have


Because g(1) is bounded by 1, and f is increasing, for all 0 ≤ u ≤ 1/2 we have


Define f1 = min{f, 2k–1(k – 1)!} and f2 = f – f1. Then, by the above argument we see that f2 is non-negative and increasing, and is supported on [1/2, 1]. For gH~k1([0,1]) and 0 ≤ u ≤ 1, we can write


We construct two function classes:


Then the decomposition above gives H~k1([0,1])Uk+Vk, and thus we have for any 0 < θ < 1,


We first claim that


for some constant C depending only on k. Indeed, the claim is clearly true for k = 1, because in this case Uk consists of monotone functions that are bounded by 1. Thus, when k = 1, the inequality (10) is the special case of (3).

Suppose that (10) is true for k = r. That is, N[](ε,Ur,2)exp(Cε1r). Because


by applying Lemma 1 for L2 norm under Lebesgue measure, we have


which implies that N[](ε,Ur+1,2)exp(Cε1(r+1)) with a different constant C. Hence (10) holds for all k ≥ 1. Therefore,


Next, we prove that


Indeed, if [hi,hi], 1 ≤ iN are 2θε under the || · ||2 distance that cover H~k1([0,1]), then the brackets


1 ≤ iN, clearly cover Vk. To see they are θε-brackets, we notice that


Applying (12) and (13) to (9), we obtain


Choosing 12<θ<1 and by iteration, we obtain


for a different constant C depending only on k. Plugging (14) and (8) into (6), we obtain


If we let


then because for all g1,g2M~kδ([0,1]) we have h(g1,g2)12δg1g22, we obtain


Back to our goal of estimating N[](ε,H~k1([0,1]),h). We define







Note that


Also note that


where the last inequality follows from (7). We choose θ and δ so that 2(1θ)2=1.4 and θδ=L, where L is a large number to be fixed later. Thus by plugging the two bounds above into (17) we obtain


for some constant C1.

On the other hand, by choosing η so that 2(1η)=1.4, we have


Recall that by (16) we have


with a constant C2 depending on L. Plugging into (??), we obtain


But H~k1([0,1])AB, (19) and (20) imply that


Let Z(ε)=ε1klogN[](ε,H~k1([0,1]),h). Then the inequality above implies that


for some constant C, which further implies that


By choosing L large so that 1.4−1/k + L−1/k < 1, we immediate obtain


for some constant C. Together with (7), we have


Summarizing, we obtain

Theorem 2. There exists a constant C depending only on k, such that


2.2 Bounded k-monotone densities

Now we consider N[](ε,FkB([0,1]),h). Because FkB([0,1])MkB([0,1]), the result of Theorem 2 applies to FkB([0,1]). Our goal is to improve the constant CB12k by using the the extra fact that 01g(u)du=1 for gFkB([0,1]). For convenience, we will actually relax the condition 01g(u)du=1 in the definition of FkB([0,1]) to the condition 01g(u)du1.

Assume B > 2. For 1/B < δ < 1, we define


Then, FkB([0,1])=XδYδ. For any fFkB([0,1]), because f increases and 121g(u)du1, we have f(u) ≤ 2 for all u [set membership] [0, 1/2]. Hence, we can decompose Xδ and Yδ into two classes of functions with disjoint supports:


Therefore, we have


By iteration, we obtain


Let δ = 1/ log2 B and choose m = [left ceiling]log2 B[right ceiling], we have


We summarize this bound in the following theorem:

Theorem 3. Let FkB([0,A]) be the class of k-monotone densities on [0, A] that are bounded by B. Then


where C is a constant depending only on k.

3 Under Lp(Q) Distances

In this section we will consider the bracketing entropy of MkB([0,A]) and FkB([0,A]) under the Lp(Q) distance, where 1 ≤ p < ∞ and Q is any probability measure on [0, A]. We will prove the following theorem:

Theorem 4. (i) There exists a constant C depending only on p and k, such that for any probability measure Q on [0, A] that is absolutely continuous with respect to Lebesgue measure with bounded density q,



(ii) Let GkB([0,A]) consist of k-monotone functions g on [0, A], such that g’(0) ≥ −B, and let G~kB([0,A]) consist of probability densities that belong to GkB([0,A]). Then there exists a constant C depending only on p and k, such that for any probability measure Q on [0, A],



Remark 5. In view of the result proved in [8] the rate ε−1/k is sharp when p = 2.

Proof. The result is known for the case k = 1. Thus, we only need to consider the case k > 1.

To prove the first inequality in the statement (i), we note that


where || · ||p is the Lp distance under Lebesgue measure. Thus, it suffices to prove that for any η > 0, N[](η,Mk1([0,1]),p)Cη1k. However, this follows from the same argument as in the proof of Theorem 2. Indeed, the only change needed is to replace (11) by


which leads to N[](ε,Uk,p)eCε1k. The first inequality in the statement (i) then follows by iteration. The proof of the second inequality is also similar to that of Theorem 3.

To prove the first inequality in the statement (ii), we note that by the change of variables x = Au, we have


where P is the probability measure on [0, 1] defined by P([0, u]) = Q([0, Au]). Thus, it suffices to consider the case ABp = 1. Furthermore, by approximation, we can assume that P is absolutely continuously with respect to the Lebesgue measure on [0, 1]. Let α be the inverse function of P([0, x]). By the change of variable u = P([0, x]), it is easy to see that




By (23) we have


Applying Lemma 1, we obtain


which leads the inequality (25). The proof of (26) follows the same argument as Theorem 3, and is thus omitted.

4 Rates of convergence for the maximum likelihood estimator of a bounded k-monotone density

Let p^n,k be the MLE of a k-monotone density p0 on [0, A] based on X1, … , Xn i.i.d. with density p0FkB([0,A]) for some 0 < A, B < ∞. Thus p0 is bounded and concentrated on [0, A].

From [2] (see also [9]) we know that p^n,k is characterized by

10ykyk(yx)k1p^n,k(x)dFn(x)for ally>01p^n,k(0)0ykyk(yx)k1dFn(x)=1p^n,k(0)0yky(1xy)k1dFn(x),

with equality in the inequality in (27) at points τsupp(G^n,k)={τ1,τ2,,τm}, where we may assume that 0 < τ1 < … < τm (with m random) and where



p^n,k(0)0kyk(yx)+k1dFn(x)for ally>0.

To apply our entropy bounds we need to show that p^n,k is bounded with (arbitrarily) high probability when p0 is bounded. This is the content of the following proposition.

Proposition 6. Suppose that p0FkB([0,A]). Then the MLE p^n,k satisfies


Proof. The characterization of the MLE implies that

1kyk0y(yx)k1p^n,k(x)dFn(x)for all0<yτ1


with equality at y = τ1:


Now note that the support of G^n,k is concentrated on yτ1, so x/yx/τ1, or (1 – x/y) ≥ (1 – x/τ1) for yτ1 and 0 ≤ xτ1. Thus it follows that


Combining the last two displays we find that


which yields


by Daniels’ inequality; see [10], Theorem 9.1.2, page 345.

Now suppose that P is a collection of densities with respect to a sigma-finite measure μ. The following theorem is a simplified version of Theorem 3.4.4 of [5] page 327. Our rate theorem for the MLE p^n,k over the class Pk([0,A]), the class of k-monotone densities on [0, A], will be proved by combining the upper bound of this theorem with (an easy modification of) the rate results given in [5], Theorem 3.2.5, page 289.

Theorem 7. Suppose that X1, … , Xn are i.i.d. P0 with density p0P. Let h be the Hellinger distance between densities, and mp be defined, for pP, by




Furthermore, with Mδ={mpmp0:h(p,p0)δ,pP0}, we also have




Here is our main result of this section:

Theorem 8. Suppose that p0FkB([0,A])Pk([0,A]) for some 0 < A, B < ∞. Then the maximum likelihood estimator p^n,k of p0 in Pk satisfies


Remark 9. This generalizes the rate result of [3] (with resulting rate of convergence n−1/3) to k > 1. For the case k = 1, closely related results with the Hellinger metric replaced by the L1 metric, were obtained by [1113]. The rate established in Theorem 8 is apparently consistent with the local rate result of n−k/(2k+1) established (up to an envelope conjecture) by [2].

Proof. For simplicity, we write p^np^n,k. Let M > 0 and K > 0. Then


where, by Proposition 6,


by Daniels’ inequality, and hence IIn can be made arbitrarily small (uniformly in n) by choosing K large. Now we essentially follow the proof of Theorem 3.2.5 of [5] (with θ identified with p), but exploit the fact that p^n(0+)K. Thus, letting Mn(p)Pnm(p) we have for any large η > 0,


where the shells Sj,n are now defined with the additional restriction that p(0+) ≤ K:


Here the term IB,n can be made arbitrarily small for all large n by the consistency of p^n established by [1]. Thus the same argument as in [9] yields, with P0=FkK in (32), and ϕn(δ,FkK)ϕn(δ),


By (32)


A direct calculation using Theorem 3 gives


for the same constant C as in Theorem 3, where D [equivalent] [mid ] log(AK)[mid ]1/(2k). This implies,


By taking


where c0(51)2, we have rn2ϕn(1rn)=n. Note that the functions δ [mapsto] ϕn(δ)/δ are decreasing, therefore for any j > 0,


Hence, (33) can be estimated by



This work was supported by National Science Foundation of USA (Grant No. DMS-0405855, DMS-0804587).


MSC(2000): 62G05, 46B50


1. Williamson RE. Multiply monotone functions and their Laplace transforms. Duke Math J. 1956;23:189–207.
2. Balabdaoui F, Wellner JA. Estimation of a k-monotone density: limit distribution theory and the spline connection. Ann Statist. 2007;35(6):2536–2564.
3. van de Geer S. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann Statist. 1993;21(1):14–44.
4. Birman MŠ, Solomjak MZ. Piecewise polynomial approximations of functions of classes Wpα Mat Sb. 1967;73(115):331–355.
5. van der Vaart AW, Wellner JA. Weak convergence and empirical processes. Springer-Verlag; New York: 1996. (Springer Series in Statistics).
6. Blei R, Gao F, Li W. Metric entropy of high dimensional distributions. Proc Amer Math Soc. 2007;135(12):4009–4018.
7. Gao F, Wellner JA. Entropy estimate for high-dimensional monotonic functions. J Multivariate Anal. 2007;98(9):1751–1764.
8. Gao F. Entropy estimate for k-monotone functions via small ball probability of integrated Brownian motion. Electron Commun Probab. 2008;13:121–130.
9. Balabdaoui F, Wellner JA. Technical Report. University of Washington, Department of Statistics; 2004. Estimation of a k-monotone density, part 1: characterizations, consistency, and minimax lower bounds. [PMC free article] [PubMed]
10. Shorack GR, Wellner JA. Empirical processes with applications to statistics. John Wiley & Sons Inc; New York: 1986. (Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics).
11. Groeneboom P. Some current developments in density estimation. Mathematics and Computer Science (Amsterdam, 1983) 1986;1:163–192.
12. Birgé L. On estimating a density using Hellinger distance and some other strange facts. Probab Theory Related Fields. 1986;71(2):271–291.
13. Birgé L. The Grenander estimator: a nonasymptotic approach. Ann Statist. 1989;17(4):1532–1549.