PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Sankhya Ser B. Author manuscript; available in PMC 2010 August 1.
Published in final edited form as:
Sankhya Ser B. 2009 August 1; 71-A(2): 331–353.
PMCID: PMC2903063
NIHMSID: NIHMS195719

What’s So Special About Semiparametric Methods?

Abstract

The number of scientific publications on semiparametric methods per year has been steadily increasing since the early 1980s. This increased interest has happened in spite of the fact that the novelty of semiparametrics for its own sake has run its course, and semiparametric methods are by now considered classical. The underlying reasons for this continued interest include the genuine scientific utility of semiparametric models combined with the breadth and depth of the many theoretical questions that remain to be answered. Empirical process techniques are an essential research tool for many of these questions. Moreover, both semiparametric methods and empirical processes are playing an increasingly valuable role in high dimensional data analysis and in other emerging areas in statistics. The topics are very fruitful and intriguing for new researchers to engage in. Graduate programs in statistics, biostatistics and econometrics can and should include more empirical processes and semiparametrics in their teaching in order to ensure a sufficient supply of suitably qualified researchers.

Keywords: Biostatistics, Data mining, Econometrics, Empirical processes, High dimensional data, Semiparametric models, Statistics education

1 Introduction

Semiparametrics have by now become a well established research area in statistics, biostatistics and econometrics. The success and impact of semiparametric methods is due to both its excellent scientific utility and intriguing theoretical complexity. An excellent history and assessment of semiparametrics up through about 2005 is given in Wellner, Klaassen and Ritov (2006), hereafter referred to as WKR, who focus on theoretical developments since the landmark book, Bickel, Klaassen, Ritov and Wellner (1993) (hereafter referred to as BKRW). A very interesting and useful history of semiparametrics is also given in WKR. We encourage the interested reader to review WKR as background for the present article. The purpose of the present article is not to duplicate WKR but to attempt to expand the scope of the review into new domains, including scientific philosophy and graduate education, as well as to touch on a few additional theoretical aspects not discussed previously.

A very interesting graphical analysis of the appearance of “semiparametric” in the title, keywords and/or abstract of scientific publications is given in Figure 1 of WKR. In their figure, the rise in number of such papers appears to accelerate between 1984 and 1993 and to increase approximately linearly up through 2004. A variation of this analysis for the years 1983 up through 2008 is given in Figure 1. In this figure, we only present the numbers of scientific papers with “semiparametric” in the title. The same three sources used in WKR are used here: ISI Web of Science (ISI), MathSciNet (MSN) and the Current Index of Statistics (CIS). The MSN source lags about a year behind ISI, while CIS lags about three or four years behind ISI. Lowess lines, computed using R (http://www.r-project.org/), have also been added. For both the ISI and MSN sources, a convex curve is apparent, indicating a sustained acceleration in the number of such papers. The lack of comparable convexity in the CIS curve is easily explained by the previously mentioned lag in reporting of publications in CIS. Thus, it appears that research interest in semiparametric methods has not yet reached its zenith and will likely continue to increase in importance for the foreseeable future.

Figure 1
Number of publications with “semiparametric” in the title per calendar year from 1983–2008 according to ISI Web of Science (ISI), MathSciNet (MSN) and Current Index of Statistics (CIS). Lowess smoothed lines are also included: ...

It is also important to study the frequency of semiparametric publications relative to the total number of statistical or econometric publications, since the total number of statistical or econometric publications have also increased dramatically during the same time period. A plot of the percent of publications with “semiparametric” in the title relative to the total number of publications with the string “statistic” or “econometric” anywhere in the title, key words or journal title is provided in Figure 2 for the same calendar years according to the MSN source. We only use MSN since the MSN search routine allows specifying strings that can occur anywhere in the title, key words or journal title. While this measure is not perfect, it seems to be a reasonable surrogate for the true total number of statistical or econometric publications and should capture the correct order of magnitude of such publications. In contrast, the number of semiparametric publications is grossly underestimated because only titles of papers are considered, and many papers in semiparametrics do not include “semiparametric” in the title. Nevertheless, the percentage values displayed in the figure should approximately reflect the relative changes over time in the proportion of semiparametric publications, even though the actual percentage of semiparametric publications is grossly underestimated. It is interesting to note that even after this adjustment, the proportion of semiparametric publications steadily increased throughout the time period, although the slope appears to decrease some in 1996. Note that the low outlier in 2008 is simply attributable to the one year lag in MSN mentioned previously and does not change the overall conclusion that both the net and relative amount of research activity in semiparametric models is continuing to increase over time.

Figure 2
Percent of publications with “semiparametric” in the title relative to all publications with the string “statistic” or “econometric” anywhere in the title, key words or journal title per calendar year from ...

This continued interest in semiparametrics has been sustained in spite of (or because of) the fact that semiparametric methods have been an active research area for well over two decades. It is safe to say that this area is by now a classical branch of statistics and should be included as part of standard graduate education in statistics, biostatistics and econometrics. It is also safe to say that the mere presence of semiparametrics in a paper no longer merits publication on its own without either compelling scientific impact or genuine theoretical innovation. The standards for semiparametric research are now higher than they used to be.

Much of the theoretical work in semiparametrics over the years has relied to some degree on empirical process techniques, but, for the new generation of semiparametric research, these techniques have become paramount. Moreover, both semiparametric methods and empirical processes are becoming valuable tools in high dimensional data analysis and in other emerging areas in statistics and data mining. The topic is very fruitful and intriguing for new researchers to engage in. Hence a compelling case can be made for graduate programs in statistics, biostatistics and econometrics to not only discuss semiparametric models and applications of semiparametrics in their courses but also to include rigorous theoretical training in both empirical processes and semiparametric inference to ensure a sufficient supply of qualified researchers and practitioners.

For the remainder of the paper, we will discuss in more detail the key issues in semiparametric research and practice raised above: scientific utility, theoretical issues, empirical processes, high dimensional data, availability of software, and graduate education. The paper will end with a brief summary discussion.

2 Scientific Utility of Semiparametric Models

One of the most provocative yet useful sources of tension in statistical science is the tension between overfitting and underfitting data. This issue is closely related to the tension between data mining and statistical inference. One facet of this tension is the fact that data mining, the art of discovering and understanding structure in data, is both a philosophically and practically distinct endeavor from statistical inference, the art of assessing reproducibility and generalizability of conclusions from data (see Tukey, 1962, and Tukey and Mosteller, 1968). Another facet of this tension is the fact that the ability to discover and the ability to generalize are both essential phases of scientific research. The more flexible the model, the more likely it will fit the data; and the more statistical tests we conduct when looking for structure in data, the more likely we will find something interesting. Of course, over-flexible models tend to fit new data poorly and too many tests easily lead to false conclusions. Nevertheless, if we are too rigid in our modeling and data analyses, we may miss important structure.

Semiparametric models appear to moderate this tension by providing a low-dimensional parametric component which can be easy to interpret scientifically and a high-dimensional nuisance parameter which enables flexibility in the model fitting. A surprising result is that often times the parametric part of these models can be estimated with a root-n precision close to that obtained from a fully parametric model, even if the nuisance parameter has a much slower convergence rate. An important and classic example of this is the Cox model for current status data (Huang, 1996), in which the regression parameter β [set membership] Rk is root-n consistent while the baseline hazard is cube-root-n consistent. What is perhaps surprising is how little is actually lost in terms of the strength of the inference when so much is gained in model flexibility.

Thus one would expect that semiparametric models should allow us to ask better scientific questions of data with greater reproducibility. This means that important future breakthroughs in semiparametrics should revolve around the nature and scope of the scientific questions we can probe with semiparametric models. An important class of such questions are those motivated by specific applications where the scientific goals dictate the needed structure and flexibility in the model. These questions require familiarity with the relevant scientific context, and the number of such questions is large and will continue to increase as long as the scope of science continues to increase. We will not attempt to enumerate such questions here but simply point out the importance of this facet of semiparametric modeling. Another important class of questions are concerned with general properties of semiparametric models such as ascertaining the degree to which increased flexibility in models can lead to increased ability to detect important relationships with fewer overall assumptions. One specific example of such a question that is both very important and very difficult involves determining the consequences of model misspecification. We will spend the remainder of this section examining this question in more detail. We reiterate, however, that there are also many other important questions associated with the scientific aspects of semiparametric models, but we will not pursue these other questions further here.

A somewhat classic example of a semiparametric model with resilience to certain kinds of misspecification is the log-rank statistic for right-censored survival data. Specifically, assume that we observe n i.i.d. triples (X1, δ1, Z1), …, (Xn, δn, Zn), where the (Xi, δi) terms are the usual right censored failure time measurements and the Zi [set membership] Rd terms are observed covariates. Assume that the model we are interested in specifies that the hazard function given the covariate Z = z has the form:

λ(tZ=z)=exp[βzw(t)]λ0(t),

where β [set membership] Rd are the regression parameters, w is nonnegative with positivity for some values of t and may be otherwise known or unknown, and λ0 is the baseline hazard function. When w(t) is known to be a constant, this reduces to the classic proportional hazards function.

Suppose we are interested in testing H0 : β = 0. The log-rank test in this setting is actually a chi-square test based on the estimated semiparametric efficient score

i=1n0[Zij=1nYj(t)Zjj=1nYj(t)]dNi(t),

where (N1, Y1), …, (Nn, Yn) are the usual counting and at-risk processes for observations i = 1, …, n (see Chapter 1 and Section 4.2 of Kosorok, 2008a, for this particular statistic and Fleming and Harrington, 1991, for an overview of counting process notation) (hereafter, Kosorok, 2008a, will be denoted by K). Because this statistic is equivalent to the semiparametric efficient score for β, the test based on this statistic, in the case where d = 1, has optimal local asymptotic power for the model with constant w according to Theorem 25.44 of van der Vaart (1998). It is not hard to show that this statistical test is still valid and has good local asymptotic power for general w. Thus, not only is this statistic optimal under the proportional hazards semiparametric model, it is also valid and powerful under many kinds of departures from the specified model.

This result is a hypothesis testing version of the notion of a “locally efficient estimator”. A semiparametric estimator is locally efficient if it produces root-n consistent estimators for a large class of models while also producing efficient estimators for a submodel of that class (see Section 1.2.6 of van der Laan and Robins, 2003). These are examples of semiparametric procedures that maintain good properties under certain kinds of model misspecification. Another class of semiparametric procedures that perform well under certain model misspecifications are doubly robust methods (see Section 1.6 of van der Laan and Robins, 2003). For both procedures, the estimators are not biased and the hypothesis tests have the correct size in spite of some degree of potential model misspecification.

A perhaps more challenging kind of model misspecification concerns misspecifications that lead to biased or misleading estimators of the intended parameters of interest. This can happen in both parametric and semiparametric settings, but the issue is not well studied in the semiparametric setting. One such issue is the presence of omitted variable bias that can happen when important confounders have been omitted from the model. See Clarke (2005, 2009) for a clear and provocative review of this issue for the parametric setting, with applications in econometrics and political science. One way to address this is to focus more on studying semiparametric models with confounders, although just adding more confounders to a regression model can lead to increased bias in the estimator of the regression parameter of interest (Clarke, 2005, 2009). Of course biased estimation can arise from other kinds of model misspecification besides omitted variables. Moreover, auxiliary covariates that are not in the model can sometimes be utilized to actually increase precision of regression parameter estimators (see, e.g., Chapter 5 of Tsiatis, 2006). In general, the question of interpretability of estimators is very distinct from the quest to determine the distribution of estimators and appropriate methods of inference under misspecification as studied for parametric models in Kent (1982) and White (1982); for the Cox model in Lin and Wei (1989), Sasieni (1993), and Fine (2002); and for the more general proportional hazards frailty regression models in Kosorok, et al. (2004). A related issue is inference for semiparametric models that is not likelihood based, including certain kinds of least-squares and deliberately misspecified likelihood inference (see Ma and Kosorok, 2005a). This distribution and inference issue is of significant independent research interest and is very much worth pursuing, but it is not quite as philosophically deep as the question about the scientific interpretation of regression coefficients estimated under model misspecification.

Given that the estimated regression coefficient under the misspecified model is consistent for some vector β*, the question of interpretability is essentially equivalent to the question of whether β* is in the same direction (or, equivalently, lies in the same quadrant) as the coefficient in the unknown true regression model. Li and Duan (1989) showed that estimation of the regression parameter based on the Cox model under no censoring can be consistent for the correct direction under certain kinds of model misspecifications which have correctly specified regression covariates. Kosorok, et al. (2004) extended this result to allow for censoring and model misspecification that includes the frailty regression family of nonproportional hazards models. This study of the consequences of model misspecification for semiparametric models is in its infancy and is a potentially fruitful and wide open area for future research endeavors.

3 Theoretical Issues in Semiparametrics

A very comprehensive review up through 2005 of theoretical developments in semiparametrics is given in WKR, and we will not reproduce those developments here. Nevertheless we will utilize some of the broad categories discussed in WKR and attempt to highlight a few key developments since 2005. However, we will not attempt to be comprehensive in our list of highlights, but simply present those components that seem most novel and with which the author is most familiar with. We will first discuss briefly some higher order inference results based on the profile likelihood for semiparametric models. We will then discuss a few recent developments in Bayesian methods, followed by an update on transformation, frailty and change-point models. A number of other recent developments will be discussed under the Empirical Processes and High Dimensional Data sections later on.

There are also a number of important developments we will not be discussing because of space and limitations in this author’s expertise. These include developments in survival analysis outside of transformation, frailty and change-point models, as well as the significant recent developments in semiparametric model selection. One example of an interesting recent result with both of these ingredients is Cai, et al. (2005) who develop model selection and inference for semiparametric regression for multivariate failure times when the number of covariates is allowed to increase with the sample size n. Another example is the model selection results in Cheng and Zhang (2009) for semiparametric partly linear splines. Elegant theory is developed to show that their proposed model selection procedures have the oracle property, that the resulting parametric estimators are efficient, and that the nonparametric estimators achieve the optimal rate. The optimal rate is also verified for the setting where the dimension of the covariates increases with the sample size. One key assumption that they make is that the degree of smoothness of the Sobolev space for the nonparametric terms is known. A very challenging open problem is how to develop fully data adaptive procedures for partly linear models when the degree of smoothness is not known in advance. Some interesting progress in this direction for the Cox model under current status data when the baseline hazard is smooth but with an unknown degree of smoothness can be found in Ma and Kosorok (2006). A very interesting open question is whether and how fully adaptive nonparametric methods such as that described in Brown and Low (1996) and elsewhere can be extended to the semiparametric model settings.

Another interesting issue we will not fully explore here is on new developments in the use of asymptotically pivotal statistics for semiparametric inference. The basic idea has been around for a while and is not new per se. Examples include empirical likelihood ratio tests (see, e.g., Owen, 1988; and Qin and Lawless, 1994) and profile likelihood ratio tests (see, e.g., Murphy and van der Vaart, 2000). In these settings, the asymptotic limiting distribution is generally chi-square with known degrees of freedom. More recently, in a series of papers, Banerjee and Wellner (2001, 2005), demonstrate that inference for the integrated hazard function estimated from current status data can be based on a likelihood ratio statistic with a novel, asymptotically pivotal limiting distribution which is a parameter-free—albeit complicated—functional of Brownian motion. The method is also applicable to other kinds of constrained nonparametric estimation, including monotone response function estimation. This work is extended in Banerjee, et al. (2006, 2009) to semiparametric models with both parametric and nonparametric components, although the nonparametric component has the somewhat restrictive requirement that it must be monotone.

3.1 Higher Order Inference

There are two very interesting developments in increasing precision for estimation and inference for the parametric component θ in a semiparametric model. The first result is the development of second order efficient estimators. The basic idea and an interesting application of the concept is introduced in Dalalyan, et al. (2006). The essence of the idea is that a local minimax bound is precisely achieved for certain second order efficient estimators and that this criteria distinguishes among a number of first order efficient estimators. The theory is successfully applied to the shift parameter in a semiparametric Gaussian white noise model through a carefully tuned penalized maximum likelihood estimation process. The technique appears to be promising for other semiparametric models as well, although there remains much work to be done before achieving this. The general research area of second order semiparametric efficiency has many interesting open questions.

The second result involves increasing the accuracy of inference. The profile sampler results presented in Lee, et al. (2005) are extended in Cheng and Kosorok (2008a, 2008b, 2009) to yield higher order accurate confidence intervals for the parametric component θ in a semiparametric model. The basic idea is to treat the profile semiparametric likelihood as though it were a parametric likelihood, add a prior, and then use the 1 − α credible sets from the resulting “posterior” to conduct frequentist inference for θ. The main result is that this approach is second order accurate in the presence of a nuisance parameter and also in the presence of regularization. At the time of publication of this work, parallel results were not yet available for the bootstrap, and thus it seemed that a Bayesian approach enjoyed higher order frequentist properties than any known frequentist alternatives. Results for the bootstrap, however, have since then started to move forward, and it appears that the bootstrap may also enjoy second order accuracy of inference. Nevertheless, there remain yet a number of open questions about second order accuracy of inference, including, for example, how to adaptively select the smoothing parameter in penalized semiparametric maximum likelihood in order to lead to optimal accuracy of inference.

3.2 Bayesian Methods

Research activity in Bayesian semiparametric methods has continued actively since publication of WKR. There are too many contributions to completely enumerate here, but some highlights include interesting applications in random-effects based meta-analysis (Burr and Doss, 2005), isotonic regression (Dunson, 2005), reconstruction of past environment based on fossil data (Bhattacharya, 2006), and in high-dimensional additive models with applications to electrical utility analysis (Panagiotelis and Smith, 2008). There have also been interesting developments in the analysis of recurrent events with dependent termination (see Sinha, et al., 2008, for a survey) and in estimating false discovery rates in high dimensional data (Tang, et al., 2007). While most of these contributions are more practical than theoretical, several of them involve the creation of very clever computational methods, and a few include theoretical considerations such as establishing key non-asymptotic properties of the posterior distributions (see, e.g., Burr and Doss, 2005; Panagiotelis and Smith, 2007).

Unfortunately, very little new asymptotic theory for Bayesian semiparametric methods has been developed beyond Shen (2002). Specifically, a Bernstein-von Mises theorem under reasonably general conditions is still lacking, although there are rumors that such a result is forthcoming. The main difficulty, of course, is that the nonparametric components of the models are very challenging to work with in Bayesian settings, and even consistency can be difficult to establish or verifiably impossible in some cases (see, e.g., Freedman, 1999). However, some progress has been made on methods for establishing consistency and rates of convergence for purely nonparametric models. An important result in this direction is the establishment of consistency of Bayesian density estimates using Dirichlet priors (Ghosal and van der Vaart, 2007). An important component of this result is that Bayesian methods can achieve the same rate of convergence as the best possible frequentist approaches. A key extension of this result is the discovery of easily verifiable sufficient conditions under which the Kullback-Leibler property for Dirichlet priors holds for Bayesian density estimation (Wu and Ghosal, 2008). Conditions for achieving minimaxity in density estimation based on adaptive nonparametric beta mixture priors are given in Rousseau (2009). While these results are encouraging, progress is slow and much work remains to be done.

An alternative to purely Bayesian approaches is to use approximate Bayes methods. If interest is restricted to the parametric component θ, the profile sampler of Lee, et al. (2005) is one such method mentioned in WKR. The recent establishment of higher order accuracy of the profile sampler was mentioned previously. On the other hand, when joint inference for both the parametric and nonparametric component is desired (as in Kim and Lee, 2004, for example), the piggyback bootstrap of Dixon, et al. (2005) is generally applicable, provided the nuisance parameter is n-consistent and takes the form of a measure. The main idea is to use the profile sampler (Lee, et al., 2005) to first generate a sequence of approximate posterior realizations of the parametric component {θ(j), j = 1, 2, …}, and then to use this sequence to generate a sequence of realizations of the nuisance parameter {η(j), j = 1, 2, …} by maximizing a randomly weighted nonparametric likelihood evaluated at each θ(j). The random weights are re-drawn for each j. The joint sequence {(θ(j), η(j)), j = 1, 2, …} can be shown to have the Bernstein-von Mises property, i.e., that the conditional distribution matches the asymptotic distribution in the limit. It is unclear to what extent these approximations can be utilized for Bayesian inference, and more work, both philosophically and technically, is needed.

3.3 Transformation, Frailty and Change-point Models

There has been significant progress in the theory of semiparametric transformation and frailty models beyond the results presented in WKR. For transformation models, many of the recent developments have focused on the setting where the residual distribution is known, or known up to a one-dimensional family. This leads to a specification of the survival function conditional on the possibly time-dependent covariates Z(·) of the form

S(tZ)=Gα(0teβZ(s)dΛ(s)),

where Gα is known from the residual distribution and Λ is an unspecified measure similar to the integrated hazard function. Efficient estimation and inference for the regression parameter β when α is fixed has been developed in significant generality in Slud and Vonta (2005) and further generalized in Dabrowska (2007) to allow for more general regression functions beyond eβZ. General results for the setting where α is not fixed are given in Kosorok, et al. (2004) who also discuss inference under model misspecification as mentioned previously.

Fairly general results for efficient inference for both univariate and multivariate transformation models based on frailties is given in Zeng and Lin (2007). This article also contains a very interesting and extensive series of discussion articles from many experts in semiparametrics as well as a rejoinder by the authors. Asymptotically valid inference and efficiency for normal transformation models for bivariate survival data is developed in Li, et al. (2008). Asymptotic normality, but without efficiency, for semiparametric shared frailty survival analysis in case-control studies is given in Gorfine, et al. (2009). These papers represent only a sampling of the many results in semiparametric transformation and frailty models for right-censored data. There has been significantly less work done for interval censored data. The author is only aware of one paper for transformation models for interval censored data. That paper is Ma and Kosorok (2005b) who develop efficient inference for partly linear transformation models for current status data. An interesting issue is that the slow convergence rate for the baseline hazard seems to interfere with the convergence rate for the nonparametric additive regression functions. This is connected to the general issue of how to handle multiple nuisance parameters with potentially different convergence rates. There remain many interesting open questions concerning transformation and frailty models for interval censored data.

The Cox model for right-censored data with a change-point in the regression was studied first by Pons (2003) who was able to utilize the partial-profile likelihood to avoid nuisance parameter (hazard function) estimation. A change-point regression model posits that a regression parameter β has two different values depending on whether a continuous variable Y is above or below an unknown threshold value ξ. For the Cox model, this means there are three parameters in the model β1 (for when Yξ), β2 (for when Y > ξ), and ξ. Interestingly, the nonparametric maximum likelihood procedure is fully adaptive in the sense that the asymptotic variance of the regression parameters achieves the semiparametric efficiency bound as though ξ were known in advance. The asymptotic theory for this setting is excruciatingly difficult because the rate of convergence of the maximum likelihood estimator for ξ is n and is thus faster than n. A recent discovery is that the standard argmax theorem for M-estimators is not valid for the change-point parameter and a different approach is needed (see Section 14.5.1 of K and also see Lan, et al., 2009).

In Kosorok and Song (2007), the results in Pons (2003) are extended for general transformation models, and a test of the existence of a change-point was also developed. A major difficulty for general transformation models is that the nuisance parameter can no longer be ignored and thus extreme care must be taken in the convergence rate calculations. There are a number of other interesting results for semiparametric change-point models which we will not fully enumerate here. A particularly interesting example is given in Lee and Seo (2008) who develop efficient estimation for a semiparametric binary response model with econometric applications. They also develop a test of the change-point similar to the test developed in Kosorok and Song. Many open theoretical questions for semiparametric change-point models remain, including what happens under model misspecification. In the fully parametric setting, we know that the rate of convergence for the change-point parameter changes dramatically from n to n1/3 under misspecification (Banerjee and McKeague, 2007). It remains to be seen how this situation changes in the presence of infinite-dimensional nuisance parameters.

4 Empirical Processes

Empirical processes techniques are necessary for most non-trivial theoretical research questions in semiparametrics. This includes the development and use of Glivenko-Cantelli theorems to establish uniform consistency of suitably standardized objective functions for both existence and consistency of estimators. Establishing this can be quite challenging: see, for example, the proofs of Theorems 1 and 3 of Kosorok, et al., 2004. Empirical processes are also needed for determining rates of convergence of estimators. This is often quite challenging, especially when more than one rate is involved. See both Chapter 3.4 of van der Vaart and Wellner (1996) (denoted VW hereafter) and Chapter 4 of K for discussions of techniques for rate determination and both Ma and Kosorok (2005b) and Kosorok and Song (2007) for challenging examples involving multiple rates. Donsker theorems are usually needed for establishing weak convergence of both finite and infinite dimensional parameters. Both Z-estimation (see Chapter 3.3 of VW and Chapter 13 of K) and M-estimation (see Chapter 3.2 of VW and Chapter 14 of K) theory for empirical processes, along with functional delta methods (see Chapter 3.9 of VW and Chapter 12 of K) and general functional analysis (see Chapters 6 and 17 of K) can be quite helpful here. For a challenging example that uses most of these techniques, see the previously mentioned Kosorok and Song (2007) paper.

Another very powerful tool from empirical processes that is quite useful for semiparametric models is the empirical process bootstrap and related results. For a discussion of general empirical process bootstrap theory, see Chapter 3.6 of VW and Chapter 10 of K. Monte Carlo approaches such as the bootstrap are especially crucial for inference for infinite-dimensional parameters, such as the baseline integrated hazard in right-censored Cox regression and similar settings where there exists a tight limiting distribution. In these settings, the infinite-dimensional aspects prevent using the estimated covariance function directly for inference as can be done in finite-dimensional settings, and confidence bands need to be constructed from Monte Carlo realizations of the approximate limiting distribution. Note that the weighted bootstrap (see Section 2.2.3 of K) sometimes has better moderate sample size properties than the nonparametric bootstrap, especially for right-censored data, where the possibility of having a nonparametric bootstrap sample with no uncensored observations exists. For semiparametric models with all root-n consistent parameters, the validity of the bootstrap follows fairly directly from consistency and weak convergence validation. Establishing such validity for the parametric root-n consistent parameter in a semiparametric model with slower-than-root-n consistent nuisance parameters is much more difficult. An early version of such theory which requires arduous entropy calculations is given in Ma and Kosorok (2005a). More recently, Cheng and Huang (2009) have developed very general bootstrap theory for this setting that is more straight-forward. For parametric components that are not root-n consistent, the bootstrap is known to not be valid in some settings (see, for example, Kosorok, 2008b, and Sen, et al., 2009).

The foregoing is only a subset of the many empirical process tools involved in semiparametric research, and, moreover, many of the most challenging questions in semiparametric inference will require the development of new theory and results in empirical processes. Expertise in empirical processes is thus absolutely essential for most of the many open research problems in semiparametrics.

5 High Dimensional Data

Statistical applications in high dimensional data are quite challenging for a number of reasons, including the fact that assumptions generally need to be stronger than in lower dimensional settings before statistical procedures will work. For high dimensional data, the statistical trade-offs between data mining and inference are particularly acute. A trivial example happens in linear regression when the number of predictors exceeds the number of observations: in these setting, model prediction can easily be perfect for the data used for estimation but disastrous for predicting future observations. The importance of the scientific questions raised earlier concerning the hope of semiparametrics for relaxing assumptions is both acute and complex when the dimension increases with the sample size. Empirical process methods, such as maximal inequalities (see Chapter 8.1 of K) and concentration inequalities (see Massart, 2007), for example, can provide almost miraculous assistance in developing theoretical properties of statistical methods in high dimensional data. A very interesting paper which discusses many of these issues is Donoho (2000).

There have been a number of interesting, recent developments for using semiparametrics in high dimensional data, especially for microarrays. The earliest such example the author is aware of is Huang, et al. (2003) and Huang and Zhang (2005) who develop a semiparametric linear (semi-linear) approach for microarrays and included some theoretical justification. Additional developments for the semi-linear approach can be found in Huang, et al. (2005) and Fan, et al. (2005) who allow the dimension to increase rapidly with the sample size. A robust semiparametric approach based on least-absolute deviation regression is developed in Ma, et al. (2006) who base their theoretical justification on the theory developed in Kosorok and Ma (2007). Another interesting example is the univariate shrinkage approach based on the Lasso for Cox regression when the number of covariates p exceeds the sample size n (Tibshirani, 2009). Because the Lasso is used, model selection is accomplished along with estimation. There are a number of other related results, but the applications and theoretical results for semiparametric models in high dimensions is in its infancy, and much work remains to be done. This area could rapidly become one of the most active and vibrant research areas in all of statistics.

6 Availability of Software

A very crucial gap exists between semiparametric methodology and the routine use of semiparametric methods in practice. This gap is largely due to the limited availability of practical software for semiparametrics. Right-censored survival analysis is a fortunate exception to this since semiparametric survival analysis is widely used and revered in medical research. This broad use is in part attributable to the availability of semiparametric survival analysis software in widely available packages such as SAS (Cary, North Carolina, U.S.A.) and R (http://www.r-project.org). The author is not aware of any other examples where semiparametric methods are so broadly used and accepted in practice. The challenges that need to be overcome before practical software becomes more widely available for semiparametric methodology are three fold:

  1. The importance of developing effective software needs to be dramatically better recognized among semiparametric methods researchers, and there needs to be greater encouragement given to researchers interested in software implementation.
  2. Often the presence of tuning parameters in semiparametric inference algorithms precludes the development of usefully general algorithms which are capable of automatically selecting the tuning parameters, without requiring human intervention for each new setting. This is true, for example, for those methods involving complex optimization (see, for example, Murphy and van der Vaart, 2000) and/or Markov chain Monte Carlo (see, for example, Lee, et al., 2005).
  3. In an increasing number of cases, the computational complexity is so great that expertise from computer science or operations research is needed before real-time computation is feasible. Thus, collaborations with computing specialists should be more frequently sought out and encouraged in semiparametric research.

The greatest tragedy of this “computational gap,” if it is not minded and successfully addressed, is that the tremendous scientific potential of semiparametric methods to exert a positive influence on scientific developments will fall far short of its potential.

7 Graduate Education

Hopefully it is clear by this point in the paper that semiparametrics is an interesting and important research area with tremendous growth potential in statistics, biostatistics and econometrics. Thus it is important to ensure that graduate education at universities is strengthened to accommodate this need. There are several important points to consider in going forward with this strengthening process: First, semiparametrics is by now a classical area of statistics and should have a prominent place in the core graduate curriculum in statistics, biostatistics and econometrics. Second, it is not enough to just present a few examples in a few courses as is currently often done. Third, empirical process techniques are absolutely necessary for almost all cutting-edge research in this area and thus training should be available to all doctoral students in statistics, biostatistics and econometrics. Fourth, some programs may need to increase their analytical theory training in the kind of analysis useful in empirical process and semiparametric research, especially functional analysis, although much of this could be accomplished as part of a course on empirical processes.

Both the University of Wisconsin at Madison (hereafter denoted UW) and the University of North Carolina at Chapel Hill (hereafter denoted UNC) recently underwent an adjustment and modernization of the statistical theory training in the Departments of Statistics (at UW) and Biostatistics (at UNC). Part of this included an upgrading of the two-semester core theory sequence of the Ph.D. program that is now being taught at the level of Shao (1999) or slightly below that level. For students interested in more technically advanced statistical research, such as semiparametrics and empirical processes, the core theory sequence is followed by a one semester advanced asymptotic statistics course at the level of van der Vaart (1998) followed by a one semester course on empirical processes and semiparametric inference at the level of K. This two-course advanced sequence is now being jointly taught in the Department of Biostatistics and the Department of Statistics and Operations Research at UNC. A similar two-course advanced sequence was implemented at UW in the recent past. The experience to date is that this advanced sequence is very successful and students emerge from this training very well prepared to do technically challenging research in semiparametrics.

The van der Vaart (1998) text introduces a number of advanced topics in statistical theory, including Le Cam’s local asymptotic normality theory, U-statistics, rank statistics, other nonparametric topics, and a number of other important technical areas in asymptotic statistics. Empirical processes and some functional analysis is introduced in Chapters 18–20. Semiparametric inference and some additional functional analysis is presented in Chapter 25. The K text is significantly more developed in the empirical processes and semiparametrics areas and will enable students to prepare for research in semiparametrics and also enable utilization of the classic texts in modern empirical processes, VW, and in modern semiparametric theory, BKRW. Both VW and BKRW are generally much too advanced for graduate students who have not first worked through K. Two other recent reference texts that can be useful for students interested in empirical process work are Pollard (1990) and van de Geer (2000), although these books are somewhat narrow for use as course texts. There are also many other useful reference texts in empirical processes that this author is neglecting to mention which are valuable for both research and training, although many of these are geared more toward probabalists and mathematicians than statisticians. Both van der Laan and Robbins (2003) and Tsiatis (2006) are useful to students as specialty reference texts in semiparametrics. Of course, everything that has been suggested for graduate students is also useful for post-graduate researchers in statistics, biostatistics and econometrics.

Each graduate program should seriously consider these suggestions, along with other ideas and texts, and develop their own advanced theory training according to their own goals and priorities. Not considering such improvements may lead to decreased competitiveness of student graduates.

8 Discussion

There are a few important messages that hopefully are by now clear. First, semiparametrics is a special and vibrant area of statistics, and the rate of research activity and interest in semiparametrics is continuing to increase. Second, there are many challenging and interesting important open research questions in semiparametrics. Third, some of the most interesting aspects about semiparametrics are the scientific issues, including the scope of questions that can be asked and the degree of freedom from assumptions. Fourth, generally speaking the most important theoretical questions that remain to be answered are very difficult and require expertise in empirical processes and in other advanced technical areas. Fifth, implementation of semiparametric methods in practical software needs to become a higher priority. Sixth and finally, the area of semiparametrics should be considered classical at this point and should be more deeply and broadly integrated into graduate education in order to better prepare the next generation of statisticians.

Acknowledgments

The author thanks Editors Pranab K. Sen and Soumendra N. Lahiri and an anonymous referee for helpful comments that led to a significantly improved paper. This research was supported in part by U.S. National Institutes of Health grant CA075142.

References

  • Banerjee M, Biswas P, Ghosh D. A semiparametric binary regression model involving monotonicity constraints. Scandinavian Journal of Statistics. 2006;33:673–697.
  • Banerjee M, McKeague IW. Confidence sets for split points in decision trees. Annals of Statistics. 2007;35:543–574.
  • Banerjee M, Mukherjee D, Mishra S. Semiparametric binary regression models under shape constraints with an application to Indian schooling data. Journal of Econometrics. 2009;149:101–117.
  • Banerjee M, Wellner JA. Likelihood ratio tests for monotone functions. Annals of Statistics. 2001;29:1699–1731.
  • Banerjee M, Wellner JA. Confidence intervals for current status data. Scandinavian Journal of Statistics. 2005;32:405–424.
  • Bhattacharya S. A Bayesian semiparametric model for organism based environmental reconstruction. Environmetrics. 2006;17:763–776.
  • Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press; 1993. Reprinted (1998), Springer: New York.
  • Brown LD, Low MG. Asymptotic equivalence of nonparametric regression and white noise. Annals of Statistics. 1996;24:2384–2398.
  • Burr D, Doss H. A Bayesian semiparametric model for random-effects meta-analysis. Journal of the American Statistical Association. 2005;100:242–251.
  • Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. [PMC free article] [PubMed]
  • Cheng G, Huang J. Bootstrap consistency for general semiparametric M-estimate. Unpublished manuscript 2009
  • Cheng G, Kosorok MR. Higher order semiparametric frequentist inference with the profile sampler. Annals of Statistics. 2008a;36:1786–1818.
  • Cheng G, Kosorok MR. General frequentist properties of the posterior profile distribution. Annals of Statistics. 2008b;36:1819–1853.
  • Cheng G, Kosorok MR. The penalized profile sampler. Journal of Multivariate Analysis. 2009;100:345–362. [PMC free article] [PubMed]
  • Cheng G, Zhang HH. Sparse and efficient estimation for partial spline models with increasing dimension. Unpublished manuscript 2009 [PMC free article] [PubMed]
  • Clarke KA. The phantom menace: Omitted variable bias in econometric research. Conflict Management and Peace Science. 2005;22:341–352.
  • Clarke KA. Return of the phantom menace: Omitted variable bias in political research. Conflict Management and Peace Science. 2009;26:46–66.
  • Dabrowska DM. Information bounds and efficient estimation in a class of censored transformation models. Acta Applicandae Mathematicae. 2007;96:177–201.
  • Dalalyan AS, Golubev GK, Tsybakov AB. Penalized maximum likelihood and semiparametric second-order efficiency. Annals of Statistics. 2006;34:169–201.
  • Dixon JR, Kosorok MR, Lee BL. Functional inference in semiparametric models using the piggyback bootstrap. Annals of the Institute of Statistical Mathematics. 2005;57:255–277.
  • Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. Unpublished manuscript 2000
  • Dunson DB. Bayesian semiparametric isotonic regression for count data. Journal of the American Statistical Association. 2005;100:618–627.
  • Fan J, Peng H, Huang T. Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency (with discussion) Journal of the American Statistical Association. 2005;100:718–813.
  • Fine JP. Comparing nonnested Cox models. Biometrika. 2002;89:635–647.
  • Fleming TR, Harrington DP. Counting Processes and Survival Analysis. Wiley; New York: 1991.
  • Freedman D. Wald lecture: On the Bernstein-von Mises theorem with infinite-dimensional parameters. Annals of Statistics. 1999;27:1119–1140.
  • Ghosal S, van der Vaart A. Posterior convergence rates of Dirichlet mixtures at smooth densities. Annals of Statistics. 2007;35:697–723.
  • Gorfine M, Zucker DM, Hsu L. Case-control survival analysis with a general semiparametric shared frailty model: A pseudo full likelihood approach. Annals of Statistics. 2009;37:1489–1517. [PMC free article] [PubMed]
  • Huang J. Efficient estimation for the proportional hazard model with interval censoring. Annals of Statistics. 1996;24:540–568.
  • Huang J, Kuo H-C, Koroleva I, Zhang C-H, Bento Soares M. Technical Report 321. Department of Statistics and Actuarial Science, University of Iowa; Ames, Iowa: 2003. A semilinear model for normalization and analysis of cDNA microarray data.
  • Huang J, Wang D, Zhang CH. A two-way semilinear model for normalization and analysis of cDNA microarray data. Journal of the American Statistical Association. 2005;100:814–829.
  • Huang J, Zhang CH. Asymptotic analysis of a two-way semilinear model for microarray data. Statistica Sinica. 2005;15:597–618.
  • Kent JT. Robust properties of likelihood ratio tests. Biometrika. 1982;61:19–27.
  • Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Annals of Statistics. 2004;32:1492–1512.
  • Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. Springer; New York: 2008a.
  • Kosorok MR. Bootstrapping the Grenander estimator. In: Balakrishnan N, Peña EA, Silvapulle MJ, editors. Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen. Institute of Mathematical Statistics; Hayward, CA: 2008b. pp. 282–292.
  • Kosorok MR, Lee BL, Fine JP. Robust inference for univariate proportional hazards frailty regression models. Annals of Statistics. 2004;32:1448–1491.
  • Kosorok MR, Ma S. Marginal asymptotics for the “large p, small n” paradigm: With applications to microarray data. Annals of Statistics. 2007;35:1456–1486.
  • Kosorok MR, Song R. Inference under right censoring for transformation models with a change-point based on a covariate threshold. Annals of Statistics. 2007;35:957–989.
  • Lan Y, Banerjee M, Michailidis G. Change-point estimation under adaptive sampling. Annals of Statistics. 2009 In press.
  • Lee BL, Kosorok MR, Fine JP. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969.
  • Lee S, Seo MH. Semiparametric estimation of a binary response model with a change-point due to a covariate threshold. Journal of Econometrics. 2008;144:492–499.
  • Li KC, Duan N. Regression analysis under link violation. Annals of Statistics. 1989;17:1009–1052.
  • Li Y, Prentice RL, Lin X. Semiparametric maximum likelihood estimation in normal transformation models for bivariate survival data. Biometrika. 2008;95:947–960. [PMC free article] [PubMed]
  • Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American Statistical Association. 1989;84:1074–1078.
  • Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. Journal of Multivariate Analysis. 2005a;96:190–217.
  • Ma S, Kosorok MR. Penalized log-likelihood estimation for partly linear transformation models with current status data. Annals of Statistics. 2005b;33:2256–2290.
  • Ma S, Kosorok MR. Adaptive penalized M-estimation with current status data. Annals of the Institute of Statistical Mathematics. 2006;58:511–526.
  • Ma S, Kosorok MR, Huang J, Xie H, Manzella L, Bento Soares M. Robust semiparametric cDNA microarray normalization and significance analysis. Biometrics. 2006;62:555–561. [PubMed]
  • Massart P. Concentration Inequalities and Model Selection. Springer; New York: 2007.
  • Murphy SA, van der Vaart AW. On profile likelihood. with comments and a rejoinder by the authors. Journal of the American Statistical Association. 2000;95:449–485.
  • Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249.
  • Panagiotelis A, Smith M. Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. Journal of Econometrics. 2008;143:291–316.
  • Pollard D. NSF-CBMS Regional Conference Series in Probability and Statistics. Vol. 2. Institute of Mathematical Statistics and American Statistical Association; Hayward, CA: 1990. Empirical Processes: Theory and Applications.
  • Pons O. Estimation in a Cox regression model with a change-point according to a threshold in a covariate. Annals of Statistics. 2003;31:442–463.
  • Qin J, Lawless JF. Empirical likelihood and general estimating equations. Annals of Statistics. 1994;22:300–325.
  • Rousseau J. Rates of convergence for the posterior distributions of mixtures of betas and adaptive nonparametric estimation of the density. Annals of Statistics. 2009 In press.
  • Sasieni P. Some new estimators for Cox regression. Annals of Statistics. 1993;21:1721–1759.
  • Sen B, Banerjee M, Woodroofe MB. Inconsistency of bootstrap: the Grenander estimator. Annals of Statistics. 2009 Provisionally accepted.
  • Shao J. Mathematical Statistics. Springer; New York: 1999.
  • Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association. 2002;97:222–235.
  • Sinha D, Maiti T, Ibrahim JG, Ouyang B. Current methods for recurrent events data with dependent termination: A Bayesian perspective. Journal of the American Statistical Association. 2008;103:866–878. [PMC free article] [PubMed]
  • Slud EV, Vonta F. Efficient semiparametric estimators via modified profile likelihood. Journal of Statistical Planning and Inference. 2005;129:339–367.
  • Tang Y, Ghosal S, Roy A. Nonparametric Bayesian estimation of positive false discovery rates. Biometrics. 2007;63:1126–1134. [PubMed]
  • Tibshirani R. Univariate shrinkage in the Cox model for high dimensional data. Unpublished manuscript 2009 [PMC free article] [PubMed]
  • Tsiatis AA. Semiparametric Theory and Missing Data. Springer; New York: 2006.
  • Tukey JW. The future of data analysis. Annals of Mathematical Statistics. 1962;33:1–67.
  • Tukey JW, Mosteller F. Data analysis, including statistics. In: Lindzey G, Aronson E, editors. Handbook of Social Psychology. Addison-Wesley; Reading, Massachusetts: 1968. pp. 80–112.
  • van de Geer S. Empirical Processes in M-Estimation. Cambridge University Press; New York: 2000.
  • van der Laan MJ, Robins JM. Unified Methods for Censored Longitudinal Data and Causality. Springer; New York: 2003.
  • van der Vaart AW. Asymptotic Statistics. Cambridge University Press; New York: 1998.
  • van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996.
  • Wellner JA, Klaassen CAJ, Ritov Y. Semiparametric models: a review of progress since BKRW (1993) In: Fan J, Koul HL, editors. Frontiers in Statistics. Imperial College Press; London: 2006. pp. 25–44.
  • White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:143–161.
  • Wu Y, Ghosal S. Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics. 2008;2:298–331.
  • Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data (with discussion) Journal of the Royal Statistical Society, Series B. 2007;69:507–564.