Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Neerl. Author manuscript; available in PMC 2010 October 1.
Published in final edited form as:
Stat Neerl. 2010 February 1; 64(1): 45–70.
doi:  10.1111/j.1467-9574.2009.00438.x
PMCID: PMC2860328

Estimation of a k-monotone density: characterizations, consistency and minimax lower bounds


The classes of monotone or convex (and necessarily monotone) densities on R+ can be viewed as special cases of the classes of k-monotone densities on R+. These classes bridge the gap between the classes of monotone (1-monotone) and convex decreasing (2-monotone) densities for which asymptotic results are known, and the class of completely monotone (∞-monotone) densities on R+. In this paper we consider non-parametric maximum likelihood and least squares estimators of a k-monotone density g0.We prove existence of the estimators and give characterizations. We also establish consistency properties, and show that the estimators are splines of degree k − 1 with simple knots. We further provide asymptotic minimax risk lower bounds for estimating the derivativesg0(j)(x0),0=1,,k1, at a fixed point x0 under the assumption that (1)kg0(k)(x0)>0.

Keywords: completely monotone, least squares, maximum likelihood, minimax risk, mixture models, multiply monotone, non-parametric estimation, rates of convergence, shape constraints

1 Introduction

Densities with monotone or convex shape are encountered in many non-parametric estimation problems. Monotone densities arise naturally via connections with renewal theory and uniform mixing; see Vardi, (1989) and Woodroofe and Sun (1993), for examples of the former, and Woodroofe and Sun (1993), for the latter in an astronomical context. Estimation of monotone densities on (0, ∞) was initiated by Grenander (1956a,b) with related work by Ayer et al. (1955), Brunk (1958), and Van Eeden (1957a,b). Asymptotic theory of the maximum likelihood estimation (MLE) was developed by Prakasa Rao (1969)with later contributions by Groeneboom (1985, 1989), and Kim and Pollard (1990).

Convex densities arise in connection with Poisson process models for bird migration and scale mixtures of triangular densities; see, for example, Hampel, (1987) and Anevski, (2003). Estimation of convex densities on (0, ∞) was apparently initiated by Anevski (1994) (see also Anevski, 2003), and was pursued by Jongbloed (1995). The limit distribution theory for the MLE and least square (LS) estimators and their first derivative at a fixed point was obtained by Groeneboom, Jongbloed, and Wellner (2001). For consistent estimation of the estimators at the origin, see. Balabdaoui (2007).

Estimation in the class of k-monotone densities on R+, denoted hereafter by Dk, has been very recently considered in Balabdaoui and Wellner (2007) and has several motivating components. By definition, g is k-monotone on (0, ∞) if g is non-negative and (−1)lg(l) is non-increasing and convex for l [set membership] {0, …, k − 2} for k ≥ 2, and simply non-negative and non-increasing when k = 1. As will be shown in section 2, it follows from the results of Williamson (1956), Lévy (1962), and Gneiting (1999) that g is a k-monotone density if and only if it can be represented as a scale mixture of beta(1, k) densities. For k = 1 this recovers the well-known facts that monotone densities are scale mixtures of uniform densities, and, for k = 2, that convex decreasing densities are scale mixtures of the triangular, or beta(1, 2), densities. Besides the obvious goal of generalizing the existing theory for the 1-monotone (i.e., monotone) and 2-monotone (i.e., convex and decreasing) classes D1 and D2, these classes provide a potential link to the important limiting case of the k-monotone classes, namely the class D of completely monotone densities. Densities g in D have the property that (−1)lg(l)(x) ≥ 0 for all x [set membership] (0, ∞) and l [set membership] {0, 1, …}. It follows from Bernstein’s theorem (see, e.g., Feller, 1971, p. 439, or Gneiting, 1998) that g [set membership] D if and only if it can be represented as a scale mixture of exponential densities. Completely monotone densities arise naturally in connection with mixtures of Poisson processes and have been used in reliability theory and empirical Bayes estimation, see Jewell (1982) and the references therein, and Balabdaoui and Wellner (2007) for further motivation and references.

In Balabdaoui and Wellner (2007), the joint limit distribution theory for the MLE and LSE of a k-monotone density and their higher derivatives up to degree k − 1 at a fixed point is established modulo a spline conjecture. The rate of convergence of the j-th derivative, j = 0,…,k − 1 is shown to be n(kj)/(2k + 1). Note that these rates coincide with the minimax lower bounds obtained here. As for the joint limiting distribution, it depends on a Gaussian process Hk defined uniquely almost surely as follows:

  • Hk(t)≥Yk(t), t [set membership] R.
  • (−1)k Hk is 2k-convex; that is, (1)kHk(2k2) exists and is convex.
  • The process Hk satisfies


where Yk is the (k − 1)-th fold integral of a two-sided Brownian motion plus (−1)kk!/(2k)!t2k, t [set membership] R. Jewell (1982) initiated the study of MLE in the family D and succeeded in showing that the MLE Fn of the mixing distribution function is almost surely weakly consistent. Although consistency of the MLE follows now rather easily from the results of Pfanzagl (1988) and van de Geer (1993), little is known about rates of convergence or asymptotic distribution theory for either the estimator of the mixed density or the estimator of the mixing distribution function. As noted in Balabdaoui and Wellner (2007), it may be possible to obtain some insight into the asymptotics of the MLE of a completely monotone density by better understanding the behavior of the MLE of a k-monotone density for arbitrary k>2. Indeed, as the class D is the intersection of all of the Dks, it can be well approximated by Dk with a large k.

Existence of the MLE and LSE of a k-monotone density, their characterization, their structure (splines of degree k − 1 and with simple knots), and consistency of their derivatives up to degree k − 1 are used in Balabdaoui and Wellner (2007). In this paper, we give proofs of those essential properties in sections 2 and 3. In section 4, we establish asymptotic minimax lower bounds for estimation of g0(j)(x0),j=0,,k1 under the assumption that g0(k)(x0) exists and is non-zero.

In the sequel, X1, …, Xn are i.i.d. random variables with density g0 [set membership] Dk, Gn is the corresponding empirical distribution function. We write X(1),…, X(n) for the order statistics of X1,…, Xn, use the notation z+ = z1[z≥0], and write λ for Lebesgue measure on R.

2 Existence and characterizations

2.1 Mixture representation

Lemma 1 characterizing integrable k-monotone functions and giving an inversion formula follows from the results of Williamson (1956).


(Integrable k-monotone characterization) A function g is an integrable k-monotone function if and only if it is of the form


where F is non-decreasing and bounded on (0, ∞). Thus g is a k-monotone density if and only if it is of the form of Equation 1 for some distribution function F on (0, ∞). If F in Equation 1 satisfieslimtF(t)=0g(x)dx, then at a continuity point t > 0, F is given by


where G(t)=0tg(x)dx.


The representation in equation 1 follows from theorem 5 of Lévy (1962) by taking k = n + 1 and f [equivalent] 0 on (−∞, 0]. The inversion formula 2 follows from lemma 1 in Williamson (1956) together with an integration by parts argument.

For k = 1 (k = 2), note that the characterization matches with the well-known fact that a density is non-decreasing (non-decreasing and convex) on (0, ∞) if and only if it is a mixture of uniform densities (triangular densities). More generally, the characterization establishes a one-to-one correspondence between the class of k-monotone densities and the class of scale mixture of beta densities with parameters 1 and k. From the inversion formula in equation 2, one can see that a natural estimator for the mixing distribution F is obtained by plugging in an estimator for the density g and it becomes clear that the rate of convergence of estimators of F will be controlled by the corresponding rate of convergence for estimators of the highest derivative g(k−1) of g. When k increases the densities become smoother, and therefore the inverse problem of estimating the mixing distribution F becomes harder.

2.2 Existence and characterization of the estimators

We now consider the MLE and LSE of a k-monotone density g0. We show that these estimators exist and give characterizations thereof. In the following, λ is the Lebesgue measure, 𝓜k is the class of all k-monotone functions on (0, ∞), and 𝓜k [subset or is implied by] L1(λ) is the class of all integrable k-monotone functions. Note that Dk [subset or is implied by] 𝓜kL1(λ).



be the log-likelihood function (really n−1 times the log-likelihood function). We want to maximize ln(g) over g [set membership] Dk. To do this, we change the optimization problem to one over the whole cone 𝓜kL1(λ). This can be done by introducing the ‘adjusted likelihood function’ ψn(g) defined (as in Silverman, 1982) as follows:


for g [set membership] 𝓜kL1(λ). Note that, using the one-to-one correspondence between the mixed k-monotone function g = gF and its corresponding mixing distribution F, maximizing ψn over the space 𝓜kL1(λ) is equivalent to maximizing


over the space of bounded and non-increasing functions F on (0, ∞).


The maximizer ĝn of ψn over 𝓜k ∩L1(λ) exists and belongs to Dk (and hence is a density). Furthermore, ĝn is of the form


where m [set membership] N \ {0}, and ŵ1, (...), ŵm and â1, (...),âm are respectively the weights and the support points of the maximizing (discrete) mixing distribution Fn.


It follows from lemma 2.2 that the MLE ĝn is a k-monotone spline of degree k − 1 with m simple knots â1, (...),âm (for a definition of splines and multiplicity of the knots, see, e.g., de Boor, 1978, and De Vore and Lorentz, 1993). Note that this is also equivalent to saying that ĝn is a finite mixture of beta’s with parameters 1 and k.


It can be shown that the support points of the mixing distribution Fn fall stricty between the order statistics X(1),‥‥,X(n) with at most one support point in (X(i), X(i+1)). Note also that by definition of the MLE, the last support point has to be strictly larger than X(n).


From Lindsay (1983), we conclude that there exists a unique maximizer of ln and the maximum is achieved by a finite mixture of at most n beta densities with parameters 1 and k. We denote this maximizer by fn.

By arguing as in Groeneboom et al. (2001, p. 1662), let g [set membership] 𝓜kL1(λ) such that0g(x)dx=c. Then g/c [set membership] Dk, and we can write


Hence, ĝn = fn.


Considering maximization over the bigger set 𝓜kL1(λ) is motivated by the fact that this set is a cone. Characterization of the MLE takes then a simpler form than the one we would obtain with Dk.

Lemma 3 gives a necessary and sufficient condition for a function ĝn [set membership] 𝓜kL1(λ) to be the MLE.

For k ≥ 3 it generalizes lemma 2.4 of Groeneboom et al. (2001).


Let X1, (...), Xn be i.i.d. random variables from the true density g0. A k-monotone spline ĝn of degree k − 1 and simple knots â1, (...), âm is the MLE if and only if for all t > 0




See the Appendix.


Note that t is a knot in {â1, (...), âm} if and only if (1)k1g^n(k1)(t)<(1)k1g^n(k1)(t+). Thus, the equality condition in Equation 4 can re-expressed in terms of the left and right (k − 1)-th derivative of ĝn as in Lemma 2.4 of Groeneboom et al. (2001) in the particular case of k = 2.

The MLE ĝn can be computed by means of the support reduction algorithm of Groeneboom, Jongbloed, and Wellner (2008); also see Baladaoui and Wellner (2004) for further details.

Now, we briefly consider the LSE. The LS criterion is:


We want to minimize this over g [set membership] DkL2(λ), the subset of square integrable k-monotone densities. Although existence of a minimizer of Qn over DkL2(λ) is quite easily established, the minimizer has a somewhat complicated characterization owing to the density constraint0g(x)dx=1. Therefore, we will actually consider the alternative optimization problem of minimizing Qn(g) over 𝓜kL2(λ). Here, one might wonder why consider the LSE when the MLE is a ‘natural’ density estimator. It turns out that the random processes involved in the characterization of the LSE for a finite sample size n gives a great insight into the limiting distribution of the estimator and its derivatives up to degree k − 1. Thus, even though the MLE and LSE are asymptotically equivalent, it is easier to understand and establish the asymptotic theory of the MLE through the LSE: compare the characterization of the MLE in lemma 3 with the characterization of the LSE (over 𝓜kL2(λ)) given in lemma A.2. (For more details, see Balabdaoui and Wellner, 2007). We will not develop the characterization of the LSE further here, but postpone this study to the Appendix.

3 Consistency

In this section, we will prove that both the MLE and LSE are strongly consistent. Furthermore, we will show that this consistency is uniform on intervals of the form [c, ∞), where c > 0.

Consistency of the MLEs for the classes Dk in the sense of Hellinger convergence of the mixed density is a relatively simple straightforward consequence of the methods of Pfanzagl (1988) and van de Geer (1993). As usual, the Hellinger distance H is given by H2(p,q)=(1/2)(pq)2dμ for any common dominating measure μ.


Suppose that ĝn is the MLE of g0 in the class Dk. Then,


Furthermore, Fnd F0 almost surely where Fn is the MLE of the mixing distribution function F0


Note that math Dk = {gF :F is a d.f. on (0, ∞)} with


and kx is the the scaled beta(1, k) kernel; that is, kx(t)=t(1tx/k)+k1. For all x > 0, the kernel kx is bounded, continuous, and it is easy to see that it satisfies limt[SE pointing arrow]0kx(t) = limt→∞kx(t) = 0. Hence, the map F [mapsto] gF is continuous with respect to the vague topology for all x > 0. This implies that the class

𝒢={gFgF+gF0,Fis a d.f.on(0,)}

is continuous in F with respect to the vague topology for every x > 0. Now, as the family of sub-distributions F on (0, ∞) is compact for the vague topology (see, e.g., Bauer, 1981), and the class G is uniformly bounded by 1, we conclude by lemma 5.1 of van der Geer (1993) that g is P0-Glivenko–Cantelli. It follows by corollary 1 of van der Vaart and Wellner (2000) that H(ĝn, g0)→a.s. 0. The second assertion of the proposition follows from lemma 5.2 of van de Geer (1993).

Lemma 4 establishes a useful bound for k-monotone densities.


If g is a k-monotone density function for k ≥ 2, then


for all x > 0.


We have


by an easy calculation. [Note that when k = 2, this bound equals 1/(2x) which agrees with the bound given by Jongbloed (1995, p. 117) and Groeneboom et al., (2001, p. 1669) in this case.]


Let c > 0. Then for j = 0, 1, (...), k − 2


and for each x > 0 at which g0 is (k − 1)-times differentiable, g^n(k1)(x)a.s.g0(k1)(x).


Using the first part in the characterization of the MLE, we have


Let Fn denote again the MLE of the mixing distribution. By the Helly–Bray theorem, there exists a subsequence {Fl} that converges weakly to some distribution function F and hence for all x > 0 ĝl (x) → ĝ(x) as l → ∞ where


The previous convergence is uniform on [c, ∞), c > 0. This follows as ĝl and ĝ are monotone and ĝ is continuous.

Using the inequality 6 we can show that the limit ĝ and g0 have to be the same, which implies the consistency result. The proof follows along the lines of Groeneboom et al. (2001, p. 1674–1675; see the Appendix). Consistency of the higher derivatives can be shown recursively using convexity of (1)jg^n(j) for j = 1, …, k − 1 in the same way as in the proof of lemma 3.1 of Groeneboom et al. (2001): for small h > 0, convexity of (1)jg^n(j) allows us to write, for j = 0, …, k −2,


By letting n → ∞, this implies that


By letting h [SE pointing arrow] 0, we conclude consistency ofg^n(j)(x),j=0,,k1 for x [set membership] (0, ∞). Note that consistency of (1)jg^n(j),j=1,,k2 is uniform on intervals of the form [c, ∞) because of continuity of those derivatives. For k − 1, only pointwise strong consistency of (1)k1g^n(k1) can be claimed.

We also have strong and uniform consistency of the LSE gn on intervals of the form [c, ∞), c > 0. The relevant result and proof are deferred to the Appendix.

4 Asymptotic minimax risk lower bounds for the rates of convergence

In this section, our goal is to derive minimax lower bounds for the behavior of any estimator of a k-monotone density g and its first k − 1 derivatives at a point x0 for which the k-th derivative exists and is non-zero. The proof will rely on the basic lemma 4.1 of Groeneboom (1996); see also Jongbloed (2000). This basic method seems to go back to Donoho and Liu (1987, 1991).

As before, let Dk denote the class of k-monotone densities on [0, ∞). Here is the notation we will need. Consider estimation of the j-th derivative of g [set membership] Dk at x0 for j [set membership] {0, 1, …, k − 1}. If Tn is an arbitrary estimator of the real-valued functional T of g, then the (L1-minimax risk based on a sample X1,…,Xn of size n from g which is known to be in a suitable subset Dk,n of Dk is defined by


Here the infimum ranges over all possible measurable functions tn : R nR, and Tn = tn(X1,…,Xn). When the subclasses Dk,n are taken to be shrinking to one fixed g0 [set membership] Dk, the minimax risk is called local at g0. The shrinking classes (parameterized by τ > 0) used here are Hellinger balls centered at g0:


The behavior, for n → ∞ of such a local minimax risk MMR1 will depend on n (rate of convergence to zero) and the density g0 toward which the subclasses shrink. Lemma 5 is the basic tool for proving such a lower bound.


Assume that there exists some subset {gε : ε > 0} of densities in Dk,nsuch that, as ε ↓ 0,


for some c > 0 and r > 0. Then



See Groeneboom (1996) and Jongbloed (2000).

Here is the main result of this section.


Let g0 [set membership] Dk and x0 be a fixed point in (0, ∞) such that g0 is k-times continuously differentiable at x0 (k ≥ 2). An asymptotic lower bound for the local minimax risk of any estimator Tn,j for estimating the functional Tjg0=g0(j)(x0) is given by:


where dk,j > 0, j [set membership] {0, …, k − 1}. Here




Proposition 3 also yields lower bounds for estimation of the corresponding mixing distribution function F at a fixed point.


Let g0 [set membership] Dk and let x0 be a fixed point in (0, ∞) such that g0 is k-times continuously differentiable at x0, k ≥ 2. Then, for estimating T g0 = F0(x0) where F0 is given in terms of g0 by 2,

supτ>0lim infnn1/2k+1MMR1(n,T,𝒟k,n,τ){|g0(k)(x0)|2k1g0(x0)}1/(2k+1)x0kk!dk,k1.


See the Appendix.

Both the rates of convergence n(kj)/(2k + 1) and the dependence of our lower bound on the constants g0(x0) and g0(k)(x0) match with the known results for k = 1 and k = 2 owing to Groeneboom (1985) and Groeneboom et al.. (2001), and reappears in the limit distribution theory for k ≥ 3 in Balabdaoui and Wellner (2007).


The research of the second author was supported in part by NSF grants DMS-0203320 and DMS-0503822, and by NI-AID grant 2R01 AI291968-04. The authors gratefully acknowledge helpful conversations with Carl de Boor, Nira Dyn, Tilmann Gneiting and Piet Groeneboom.



The arguments generalize those in the proof of lemma 2.4 of Groeneboom et al. (2001). If ĝn is the MLE, let gt(x)=k(tx)+k1/tk for some t > 0. For all ε > 0, we have that ĝn + εgt [set membership] 𝓜kL1(λ), and hence


yielding the inequality in Equation 3. If t [set membership] {â1, (...), âm}, then for ε [set membership] R such that |ε| is small enough, ĝn + εgt [set membership] 𝓜 kL1(λ), and hence


yielding the identity in Equation 4.

Suppose now that ĝn is a k-monotone spline of degree k − 1 and with simple knots satisfying the condition in Equation 3, and let g [set membership] 𝓜kL1(λ). By lemma 1, there exists a non-decreasing and bounded function F on (0, ∞) such that


We can write


using the inequality log z ≥ 1 − 1/z, z > 0

=0g(x)g^n(x)d𝔾n(x)+0g(x)dx=00k(tx)+k1tkg^n(x)dF(t)d𝔾n(x)+0dF(t)=0(0k(tx)+k1tkg^n(x)d𝔾n(x)+1)dF(t)0by Equation3.

Hence, ĝn is the MLE.

Establishing the characterization and structure of the LSE

In this optimization problem, existence requires more work because there is no available theory as in the case of the MLE. However, we will show that even though the resulting estimator does not necessarily have total mass one, it does have total mass converging almost surely to one and it consistently estimates g0 [set membership] Dk.

Using arguments similar to those in the proof of theorem 1 in Williamson (1956), one can show that g [set membership] 𝓜k if and only if


for a positive measure μ on (0, ∞). Thus, we can rewrite the criterion Qn in terms of the corresponding measures μ: by Fubini’s theorem


whererk(t,t)=0tΛt(tx)k1(tx)k1dx, and


wheresn,k(t)0(tx)+k1d𝔾n(x). Hence it follows that, with g = gμ


Now, we want to minimize Φn over the set χ of all non-negative measures μ on (0, ∞).


The functional Φn admits a unique minimizer [mu], and hence the LSE gn exists and is unique.


Uniqueness follows from strict convexity of Φn. To prove existence, it can be shown that Φn can be restricted to a subset C of χ on which it is lower semicontinuous, and hence the minimization problem admits a solution by applying theorem 38.B of Zeidler (1985, p. 152). In the following, we will exhibit the subset C, and show that the conditions of Zeidler’s theorem are satisfied.

We begin by checking the hypotheses of Zeidler’s theorem 38.B (Zeidler, p. 152 1985). We identity X of Zeidler’s theorem with the space χ of non-negative measures on [0, ∞), and we show that we can take M of Zeidler’s theorem to be


for some constant D < ∞.

First, we can, without loss, restrict the minimization to the space of non-negative measures on [X(1), ∞), where X(1) > 0 is the first-order statistic of the data. To see this, note that we can decompose any measure μ as μ = μ1 + μ2, where μ1 is concentrated on [0, X(1)) and μ2 is concentrated on [X(1), ∞). As the second term of Qn is zero for μ1, the contribution of the μ1 component to Qn(μ) is always non-negative, so we make inf Qn(μ) no larger by restricting to measures on [X(1), ∞).

We can restrict further to measures μ with 0tk1dμ(t)D for some finite D = Dω. To show this, we first give a lower bound for rk(s, t). For s, tt0 > 0 we have


where v0 ≈ 1.59. To prove Equation A.1, we will use the inequality

(1v/k)k1ev,  0vv0,k2.

[This inequality holds by straightforward computation; see Hall and Wellner (1979), especially their proposition 2.]

Thus, we compute




But, we also have


for s, tt0, so we conclude that Equation A.1 holds.

From the inequality A.1, we conclude that for measures μ concentrated on [X(1), ∞) we have


In contrast,


Combining these two inequalities it follows that for any measure μ concentrated on [X(1), ∞) we have


This lower bound is strictly positive if


But for such measures μ we can make Φ smaller by taking the zero measure. Thus, we may restrict the minimization problem to the collection of measures μ satisfying


Now we decompose any measure μ on [X(1), ∞) as μ = μ1 + μ2 where μ1 is concentrated on [X(1), MX(n)] and μ2 is concentrated on (MX(n), ∞) for some (large) M > 0. Then, it follows that




and hence we can restrict to measures μ with


for every M ≥ 1.

But this implies that μ satisfies


for some 0 < D = Dω < ∞, and this implies that tk−1 is uniformly integrable over μ [set membership] C.

Alternatively, for λ ≥ 1 we have


uniformly in μ [set membership] C.

This implies that for {μm} [subset or is implied by] C satisfying μm [implies] μ0 we have


and hence Φn is lower semicontinuous on C:


As Qn is lower semi-compact (i.e., the sets Cr [equivalent]{μ [set membership] C : Φn(μ) ≤ r} are compact for r [set membership] R), the existence of a minimum follows from Zeidler (1985, theorem 38.B, p. 152).

Lemma A.2 characterizes the LSE.


For k ≥ 1 define Yn and Hn respectively by




for t ≥ 0. Then gn is the LSE over 𝓜k ∩ L2(λ) if and only if the following conditions are satisfied:



The arguments are very similar to those used in the proof of lemma 3.

Now, to prove that the LSE is a spline of degree k − 1 with simple knots, we need the following intermediate result.


Let [a, b]] [subset, dbl equals] (0, ∞) and let g be a non-negative and non-increasing function on [a, b]. For any polynomial Pk − 1 of degreek − 1 on [a, b], if the function


admits infinitely many zeros in [a, b], then there exists t0 [set membership] [a, b] such that g [equivalent] 0 on [t0, b] and g > 0 on [a, t0) if t0 > a.


By applying the mean value theorem k times, it follows that (k − 1)!g = Δ(k) admits infinitely many zeros in [a, b]. But as g is assumed to be non-negative and non-increasing, this implies that if t0 is the smallest zero of g in [a, b], then g [equivalent] 0 on [t0, b]. By definition of t0, g > 0 on [a, t0) if t0 > a.

Now we will use the characterization of the LSE gn together with the previous proposition to show that it is a finite mixture of beta(1, k)s. We know from lemma A.1 that gn is the LSE if and only if Equation A.4 holds. The equality condition in the second part of equation A.4 implies that Hn and Yn have to be equal at any point of increase of the monotone function(1)k1g˜n(k1). Therefore, the set of points of increase of (1)k1g˜n(k1) is included in the set of zeros of the function [Delta with tilde]n = HnYn.

Now, note that Yn can be given by the explicit expression:


In other words, Yn is a spline of degree k − 1 with simple knots at X(1), …, X(n), the order statistics of X1, …, Xn. Also note that the function (1)k1g˜n(k1) cannot have a positive density with respect to Lebesgue measure λ. Indeed, if we assume otherwise, then we can find 0 ≤jn and an interval 𝐼 [subset or is implied by] (X(j), X(j + 1))(with X(0) = 0 and X(n + 1) = ∞) such that 𝐼 has a non-empty interior, and Hn [equivalent] Yn on 𝐼. This implies that H˜n(k)𝕐n(k)0, as Yn is a polynomial of degree k − 1 on 𝐼, and hence gn [equivalent] 0 on 𝐼. But the latter is impossible as it was assumed that (1)k1g˜n(k1) was strictly increasing on I. Thus, the monotone function (1)k1g˜n(k1) can have only two components: discrete and singular. In the following, we will prove that it is actually discrete with finitely many points of jump.


There exists m [set membership] N\{0}, ã1, (...), ãm and w1, (...), wm such that for all x > 0, the LSE gn is given by


Consequently, the equality part in Equation A.4 can be re-expressed as Hn(t) = Yn(t) if t is a point in the support of the minimizing (mixing) measure Fn (or a knot of gn).


We need to consider two cases:

  1. The number of zeros of [Delta with tilde]n = HnYn is finite. This implies by the equality condition in Equation A.4 that the number of points of increase of (1)k1g˜n(k1) is also finite. Therefore, (1)k1g˜n(k1) is discrete with finitely many jumps and hence gn is of the form given in Equation A.5.
  2. Now, suppose that [Delta with tilde]n has infinitely many zeros. Let j be the smallest integer in {0, (...), n − 1} such that [X(j), X(j + 1)] contains infinitely many zeros of [Delta with tilde]n,k (with X(0) = 0 and X(n + 1) = ∞). By proposition A.2, if tj is the smallest zero of gn in [X(j), X(j + 1)], then gn [equivalent] 0 on [tj, X(j + 1)] and gn > 0 on [X(j), tj) if tj > X(j). Note that from the proof of proposition A.1, we know that the minimizing measure [mu]n does not put any mass on (0, X(1)], and hence the integer j has to be strictly greater than 0.

Now, by definition of j, [Delta with tilde]n has finitely many zeros to the left of X(j), which implies that (1)k1g˜n(k1) has finitely many points of increase in (0, X(j)). We also know that gn [equivalent] 0 on [tj, ∞). Thus we only need to show that the number of points of increase of (1)k1g˜n(k1) in [X(j), tj) is finite, when tj > X(j). This can be argued as follows. Consider zj to be the smallest zero of [Delta with tilde]n in [X(j), X(j + 1)). If zjtj, then we cannot possibly have any point of increase of (1)k1g˜n(k1) in [X(j), tj) because it would imply that we have a zero of [Delta with tilde]n that is strictly smaller than zj. If zj < tj, then for the same reason, (1)k1g˜n(k1) has no point of increase in [X(j), zj). Finally, (1)k1g˜n(k1) cannot have infinitely many points of increase in [zj, tj) because that would imply that [Delta with tilde]n has infinitely many zeros in (zj, tj), and hence by lemma A.1, we can find tj(zj,tj) such that gn [equivalent] 0 on [tj,tj]. But, this is impossible as gn > 0 on [X(j), tj).


ĝ = g0 (proof of Proposition 2). For 0 < α < 1 defineηα=G01(1α). Let ε > 0 be small so that ε < ηε.

By Equation 6, there exists a number Dε > 0 such that ĝlε) ≥ Dε for sufficiently large l. To see this, note that Equation 6, implies that


and hence


by the choice of ηε, and the claim follows by taking Dε=ηεg0(x)dG0(x)/2. Hence, by the bound in lemma 4, we have


It follows that g0/ĝl is uniformly bounded on the interval [ε,ηε]; that is, there exist two constants ε and cε such that for all x [set membership] [ε, ηε]


In fact,






uniformly on [ε, ηε]. Using Equation 6, we have for sufficiently large l and


But as Gl converges weakly to G0 the distribution function of g0 and g0/ĝ is continuous and bounded on [ε, ηε]; we conclude that


Now, by Lebesgue’s monotone convergence theorem, we conclude that


which is equivalent to


Define τ=0g^(x)dx. Then, ĥ = τ−1ĝ is a k-monotone density. By Equation A.6, we have that


Now, consider the function


defined on the class Cd of all continuous densities g on [0, ∞). Minimizing K is equivalent to minimizing


It is easy to see that the integrand is minimized pointwise by taking g(x) = g0(x).Hence, inf Cd K(g) ≥ 1. In particular, K(ĥ) ≥ 1 which implies that τ = 1.

Now, if gg0 at a point x, it follows that gg0 on an interval of positive length. Hence, g0g [implies] K(g) > 1. We conclude that we have necessarily ĥ = ĝ = g0.


Fix c > 0 and suppose that the true k-monotone density g0 satisfies0x1/2g0(x)dx<. Then, ‖gng02a.s. 0,

supx[c,)|g˜n(j)(x)g0(j)(x)|a.s.0, asn,

for j = 0, 1, (...), k − 2, and, for each x > 0 at which g0 is (k − 1)-times differentiable, g˜n(k1)(x)a.s.g0(k1)(x). Here, ‖ · ‖2 denotes the L2-norm.


The main difficulty here is that the LSE gn is not necessarily a density in that it may integrate to more than one; indeed it can be shown that 0g˜1(x)dx=((2k2)/k)(11/(2k1))k2>1 for k ≥ 3. However, once we show that gn stays bounded in L2 with high probability, the proof of consistency will be much like the one used for k = 2; that is, consistency of the LSE of a convex and decreasing density (see Groeneboom et al., 2001). The proof for k = 2 is based on the very important fact that the LSE is a density, which helps in showing that gn at the last jump point τn [set membership] [0, δ] of g˜n for a fixed δ > 0 is uniformly bounded. The proof would have been similar if we only knew that 0g˜n(x)dx=Op(1).

Here we will first show that 0g˜n2dλ=Op(1). Note that the equality part in Equation (A.4) can be re-written as 0g˜n2(x)dx=0g˜n(x)d𝔾n(x) and hence


where ũn [equivalent] gn/‖gn2 satisfies ‖ũn2 = 1. Take Fk to be the class of functions


In the following, we show that Fk has an envelope G [set membership] L1(G0). Note that for g [set membership] Fk we have


as g is decreasing. Therefore, g(x) ≤ 1/√x [equivalent] G(x) for all x > 0 and g [set membership] Fk; that is G is an envelope for the class Fk. As G [set membership] L1(G0) (by our hypothesis) it follows from the strong law that

0u˜n(x)d𝔾n(x)0G(x)d𝔾n(x)a.s.0G(x)dG0(x), asn

and hence by Equation A.7 the integral 0g˜n2dλ is bounded (almost surely) by some constant Mk.

Now we are ready to complete the proof. Let δ > 0 and τn be the last jump point of g˜n(k1) if there are jump points in the interval (0, δ]; otherwise, we take τn to be 0. To show that the sequence (gnn))n stays bounded, we consider two cases:

  1. τn ≥ δ/2. Let n be large enough so that0g˜n2dλMk. We have
  2. τn < δ/2. We have
    Using that gn is a polynomial of degree k − 1 on the interval [τn, δ], we have
    and henceg˜n(τn)2kMk/δ. By combining the bounds, we have for largen,g˜n(τn)2kMk/δ=Ck. Now, as gn(δ) ≤ gnn), the sequence gn(x) is uniformly bounded almost surely for all x ≥ δ. Using a Cantor diagonalization argument, we can find a subsequence {nl} so that, for each x ≥ δ, gnl(x) → g(x), as l → ∞. By Fatou’s lemma, we have
    However, the characterization of gn implies that Qn(gn) ≤ Qn(g0), and this yields
    Thus, we can write
    as l → ∞. The last convergence is justified as follows: as 0g˜nl2dλ is bounded almost surely, we can find a constant C > 0 such that gnlg0 admitsG(x)=Clx,x>0, as an envelope. Since G [set membership] L1(G0) by hypothesis and as the class of functions {(gg0)1[GM] : g [set membership] 𝓜kL2(λ)} is a Glivenko-Cantelli class for every M > 0 (each element is a difference of two bounded monotone functions) Equation A.10 holds. From Equation A.9, we conclude that δ(g˜(x)g0(x))2dx0, and therefore, g [equivalent] g0 on (0, ∞) as δ > 0 can be chosen arbitrarily small. We have proved that there exists Ω0 with P0) = 1 and such that for each ω [set membership] Ω0 and any given subsequence gnk (·, ω), we can extract a further subsequence gnl (·, ω) that converges to g0 on (0, ∞). It follows that gn converges to g0 on (0, ∞), and this convergence is uniform on intervals of the form [c, ∞), c > 0 by the monotonicity and continuity of g0. As for the MLE, consistency of the higher derivatives can be shown recursively using the convexity of (1)jg˜n(j) for j =1, …, k − 2.


Let μ be a positive number and consider the function gμ defined by


Now, consider the perturbation


where s(μ) is a scale to be determined later. If μ is chosen small enough so that the true density g0 is k-times continuously differentiable on [x0μ, x0 + μ], the perturbed function gμ is also k-times differentiable on [x0μ, x0 + μ] with a continuous k-th derivative. Now, let r be the function defined on (0, ∞) by


Then, we can write gμ as gμ(x) = μ2k + 3r((xx0)/μ). Then, for 0 ≤jk


The scale s(μ) should be chosen so that (1)jgμ(j)(x)>0 for all 0 ≤jk, for x [set membership] [x0μ, x0 + μ]. But for μ small enough, the sign of (1)jgμ(j) will be that of (1)jg0(j)(x0), and hence gμ is k-monotone. For j = k, gμ(k)(x0)=g0(k)(x0)+s(μ)μk+3r(k)(0). Assume that r(k)(0) ≠ 0. Set s(μ)=g0(k)(x0)/(μk+3r(k)(0)). Then, for 0 ≤jk − 1


as μ → 0, and


To compute r(j)(0), note that for m ≥ 2 and 2nm we have


where in the last equality we used Leibniz’s formula for the derivatives of a product; see, for example, Apostol (1957, p. 99). Evaluating the last expression at x = 0 yields


If m is even, we find that


as xnm/2,0 = 1. Similarly, when m is odd,


as xn − (m−1)/2,1 = 0. Now we have, for 1 ≤jk,


and hence


Therefore, when j is even the second term vanishes and


When j is odd, the first term vanishes and


Summarizing, we have shown that


We set Ck,j = r(j)(0) for 1 ≤jk. Then, Ck,k becomes

Ck,k={(2)k/2i=0k/21(k+1i)×i=0k/21(k2i1)ifkis even(2)(k1)/2i=0(k1)/21(k+1i)×i=0(k1)/2(k2i)ifkis odd.

The previous expressions can be given in a more compact form. After some algebra, we find that


We have for 0 ≤jk − 1,


where we defined λk,1(j)=|Ck,j/Ck,k| for j [set membership] {0, …, k − 1}. Furthermore, by computation and change of variables,


as μ [SE pointing arrow] 0. This gives control of the Hellinger distance as well in view of Jongbloed (2000, lemma 2, p. 282), or Jongbloed (1995, corollary 3.2, pp. 30–31). We set


Now, by using the change of variable ε = μ2k + 1(bk + o(1)), where


so that μ = (ε/bk)1/(2k + 1)(1 + o(1)), then for 0 ≤jk − 1, the modulus of continuity, mj, of the functional Tj satisfies


The result is that mj (ε) ≥ (rk,jε)(kj)/2k + 1/(1+o(1)),where rk,j=(λk,1(j)g0(k)(x0))(2k+1)/(kj)/bk and hence


which can be rewritten as


for j = 0, (...), k − 1. Finally, note that the fact that the function gμ is not exactly a density will not affect the obtained constants its integral converges to 1 as μ → 0.


Let Gμ(x)=0xgμ(t)dt. Using the inversion formula in Equation 2, we have


For j = 1, …, k, we have already established before that|Tj1(gμ)Tj1(g0)|=λk,1(j)|g0(k)(x0)|μkj. In constrast, we have for μ > 0 small enough




Using again the change of variable ε = μ2k + 1 (bk + o(1)), we obtain the claimed lower bound in the same way as in proprosition 3.

Contributor Information

Fadoua Balabdaoui, CEREMADE, Université Paris-Dauphine, Place du Maréchal de Lattre de Tassigny, 75775, Paris, CEDEX 16, France.

Jon A. Wellner, Department of Statistics, University of Washington, Box 354322, Seattle, WA 98195-4322, USA.


  • Anevski D. Estimating the derivative of a convex density. University of Lund: Department of Mathematical Statistics; 1994. Technical Report 1994:8.
  • Anevski D. Estimating the derivative of a convex density. Statistica Neerlandica. 2003;57:245–257.
  • Apostol TM. Mathematical analysis: a modern approach to advanced calculus. Reading, MA: Addision-Wesley Publishing Company, Inc.; 1957.
  • Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman E. An empirical distribution function for sampling with incomplete information. The Annals of Mathematcial Statistics. 1955;26:641–647.
  • Balabdaoui F. Consistent estimation of a convex density at the origin. Mathematical Methods of Statistics. 2007;16:77–95.
  • Balabdaoui F, Wellner JA. Estimation of a K-monotone density, part 2: algorithm for computation and numerical results. University of Washington: Department of Statistics; 2004. Technical Report 460.
  • Balabdaoui F, Wellner JA. Estimation of a K-monotone density: Limit distribution theory and the spline connection. The Annals of Statistics. 2007;35:2536–2564.
  • Brunk HD. Probability theory and elements of measure theory. London: Academic Press Inc. [Harcourt Brace Jovanovich Publisher]; 1981. Second edition of the translation by R. B. Burckel from the third German edition, Probability and Mathematical Statistics.
  • Brunk HD. On the estimation of parameters restricted by inequalities. Annals of Mathematical Statistics. 1958;29:437–454.
  • De Boor C. A practical guide to splines, vol. 27 of Applied Mathematical Science. New York: Springer-Verlag; 1978.
  • DeVore RA, Lorentz GG. Constructive approximation, vol. 303 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] Berlin: Springer-Verlag; 1993.
  • Donoho DL, Liu RC. Geometrizing rates of convergence. i. University of Califonia, Berkeley: Department of Statistics; 1987. Technical Report. 137.
  • Donoho DL, Liu RC. Geometrizing rates of convergence. II III. The Annals of Statistics. 1991;19:633–667. 668–701.
  • van Eeden C. Maximum likelihood estimation of partially or completely ordered parameters. I, Koninklijke Nederlandse Akademie van Wetenschappen Proceedings. Series A Mathematical Sciences. 1957a;19:128–136.
  • van Eeden C. Maximum likelihood estimation of partially or completely ordered parameters. II, Koninklijke Nederlandse Akademie van Wetenschappen Proceedings. Series A Mathematical Sciences. 1957b;19:201–211.
  • Feller W. An introduction to probability theory and its applications. 2nd edition. vol. II. New York: John Wiley & Sons Inc.; 1971.
  • van de Geer S. Hellinger-consistency of certain nonparametric maximum likelihood estimators. The Annals of Statistics. 1993;21:14–44.
  • Gneiting T. On the Bernstein–Hausdorff–Widder conditions for completely monotone function. Expositionces Mathaticae. 1998;16:88–119.
  • Gneiting T. Radial positive definite functions generated by Euclid’s hat. Journal of Multivariate Analysis. 1999;69:88–119.
  • Grenander U. On the theory of mortality measurement. I Skandinavisk Aktuarietidskrift. 1956a;39:70–96.
  • Grenander U. On the theory of mortality measurement. II, Skandinavisk Aktuarietidskrift. 1956b;39:125–153.
  • Groeneboom P. Estimating a monotone density. Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer Vol. II; Berkeley, CA. 1983. Wadsworth, Belmont, CA: Wadsworth Statist./Probab. Ser.; 1985.
  • Groeneboom P. Brownian motion with a parabolic drift and Airy functions. Probability Theory and Related Fields. 1989;81:79–109.
  • Groeneboom P. Lectures on probability theory and statistics (Saint-Flour 1994) Berlin: Springer; 1996. Lectures on inverse problems; pp. 67–164. vol. 1648 of Lecture Notes in Mathematics.
  • Groeneboom P, Jongbloed G, Wellner JA. Estimation of a convex function: characterizations and asymptotic theory. The Annals of Statistics. 2001;29:1653–1698.
  • Groeneboom P, Jongbloed G, Wellner JA. The support reduction algorithm for computing non-parametric function estimates in mixture models. Scandinavian Journal of Statistics. 2008;35:385–399. [PMC free article] [PubMed]
  • Hall WJ, Wellner JA. The rate of convergence in law of the maximum of an exponential sample. Statistica Neerlandica. 1979;33:151–154.
  • Hampel FR. Design modelling and anlysis of some biological datasets. In: Mallows C, editor. design, data and analysis, by some friends of Cuthbert Daniel. New york: Wiley; 1987. pp. 111–115.
  • Jewell NP. Mixtures of exponential distributions. The Annals of Statistics. 1982;10:479–484.
  • Jongbloed G. Ph.D. Thesis. Delft University of Technology; 1995. Three statistical inverse problems.
  • Jongbloed G. Minimax lower bounds and moduli of continuity. Statistics & Probability Letters. 2000;50:279–284.
  • Kim J, Pollard D. Cube root asymptotics. The Annals of Statistics. 1990;18:191–219.
  • Lévy P. Extensions d’un théorème de D. Dugué et M, Girault. Probability Theory and Related Fields. 1962;1:159–173.
  • Lindsay BG. The geometry of mixture likelihoods: a general theory. The Annals of Statistics. 1983;11:86–94.
  • Pfanzagl J. Consistency of maximum likelihood estimators for certain nonparametric families in particular: mixtures. Journal of Statistical Planning and Inference. 1988;19:137–158.
  • Prakasa Rao BLS. Estimation of a unimodal density. Sankhyā Series A. 1969;31:23–36.
  • Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics. 1982;10:795–810.
  • van der vaart A, Wellner JA. Hign dimensional probability, II (Seattle, WA, 1999), vol. 47 of Progr. Probab. Boston, MA: BirKhäuser Boston; 2000. Preservation theorems for Glivenko-Cantelli and unifrom Gilvenko-Cantelli classes; pp. 115–133.
  • Vardi Y. Multiplicative censoring, renewal process, deconvolution and decreasing density: nonparametric estimation. Biometrika. 1989;76:751–761.
  • Williamson RE. Multiply monotone functions and their Laplace transforms. Duke Mathematical Journal. 1956;23:189–207.
  • Woodroofe M, Sun J. A penalized maximum likelihood estimate of f(0+) when f is nonincreasing. Statistics Sinica. 1993;3:501–515.
  • Zeidler E. Nonlinear functional analysis and its applications. III. New york: Springer-Verlag; 1985. Variational methods and optimization Translated from German by Leo F. Boron.