Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2875362

Formats

Article sections

- Summary
- 1. Introduction
- 2. The Method
- 3. Numerical Studies
- 4. Discussion
- 5. Supplementary Materials
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 May 25.

Published in final edited form as:

Published online 2009 May 4. doi: 10.1111/j.1541-0420.2009.01260.x

PMCID: PMC2875362

NIHMSID: NIHMS141944

Ying Qing Chen, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A;

Ying Qing Chen: gro.prahcs@nehcqy

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Size-biased sampling arises when a positive-valued outcome variable is sampled with selection probability proportional to its size. In this article, we propose a semiparametric linear regression model to analyze size-biased outcomes. In our proposed model, the regression parameters of the covariates are of major interest, while the distribution of random errors is unspecified. Under the proposed model, we discover that the regression parameters are invariant regardless of size-biased sampling. Following this invariance property, we develop a simple estimation procedure for inferences. Our proposed methods are evaluated in simulation studies and applied to two real data analyses.

Muttlak (1988) presented a study to estimate vegetation coverage in an area of Laramie, Wyoming. This area was an old limestone quarry dominated by regrowth of mountain mahogany (*Cercocarpus Montanus*). As described in the study, a line-intercept sampling method (Canfield, 1941) was used: a straight baseline was first established, and then parallel transect lines were drawn perpendicular to the baseline. Those mountain mahogany shrubs intercepted by the transect lines were measured for their widths. See Figure 1 for an illustration. As shown in Figure 1, a shrub width is defined by the maximum distance between the shrub tangents that are parallel to the transect line. Apparently, shrub widths collected this way are not random samples, but subject to size-biased sampling, i.e., their probabilities of being sampled are proportional to the widths themselves. In statistical literature, this study has been a classical example to motivate methods development for size-biased sampling (Muttlak and McDonald, 1990; Jones, 1991; Wang, 1996).

Size-biased sampling arises frequently in other studies as well, for example, when tumor size is measured in cancer screening trials to study tumor biology and progression (Kimmel and Flehinger, 1991). Because of lead-time bias in cancer screening trials, a tumor may be detected according to its individual size (Ghosh, 2008). More examples of size-biased sampling have been observed in industrial fiber testing (Cox, 1969), family studies of rare genetic diseases (Patil and Rao, 1978; Davidov and Zelen; 2001), etiological studies (Simon, 1980), and chronic/early disease modeling (Zelen, 2005).

In general, consider an outcome variable *X* > 0 with distribution function *F _{X}* (

$${G}_{Y}(y)=\frac{1}{{\mu}_{X}}{\int}_{0}^{y}x{dF}_{X}(x),$$

where
$0<{\mu}_{X}=EX={\int}_{0}^{\infty}xd{F}_{X}(x)<\infty $. Cox (1969) proposed a one-sample empirical estimator of *F _{X}*(·) based on observed

Usually, covariates, say *Z*, are also collected to study potential predictors or risk factors in association with *X*. In Muttlak (1988), information was collected on maximum shrub height and total number of shrub stems, both of which are important predictors to estimate the percentage of an area’s vegetation coverage. In Kimmel and Flehinger (1991), because tumors of local lesions may grow to shed cancer cells into the lymphatic system and/or blood stream, secondary cancers known as lymph-node or distant metastases may develop and lead to rapid disease progression or death. Therefore it is important to capture the association between tumor size and the metastasis status to understand the natural history of cancer progression. In fact, information was collected on tumor metastasis status and cancer types in Kimmel and Flehinger (1991). In either example, regression becomes an important tool to measure the association.

Parametric models can be used in regression. It is however important that the assumed parametric distributions are correctly specified, otherwise serious bias may arise in inferences (Duan, 1983). Nonparametric methods have been developed to estimate regression functions of size-biased outcomes. For example, the kernel estimator developed by Jones (1991) was extended to multivariate setting (Ahmad, 1995); Wu (2000) developed a class of local polynomial estimators; and Cristóbal and Alcalá (2000) proposed several regression function estimators by way of modified local polynomials. Since *X* tends to be positive, semiparametric proportional hazards models (Cox, 1972) have also been developed, for example, to model shrub width and tumor size in Wang (1996) and Ghosh (2008), respectively. In these models, regression parameters are based on hazard functions. When *X* is not time-to-event, such as shrub width or tumor size, regression models of *X*’s hazard functions are yet to be seen practical.

For general positive outcomes, a useful alternative to the semiparametric proportional hazards model is the log-linear regression model (Kalbfleisch and Prentice, 2002, p. 218). When *X* is time-to-event, it is the so-called accelerated failure time model (Wei, Ying and Lin, 1990). It is also a transformation model in Tsiatis (1990). In this article, we aim to develop a semiparametric version of this model in size-biased sampling. In the rest of this article, we first introduce a semiparametric linear model and its invariance property in size-biased sampling. Then we develop a simple estimation procedure for inferences on regression parameters. We assess our methods validity and performance by Monte-Carlo simulations, and further demonstrate it in actual data analysis.

Let *Z* * ^{p}* be the covariates vector associated with outcome variable

$$logX=-{\beta}^{\text{T}}Z+\epsilon ,$$

(1)

where *β* * ^{p}* is the regression parameter. Here,

$${f}_{X\mid Z}(x\mid z)={f}_{\xi}(x{e}^{{\beta}^{\text{T}}z}){e}^{{\beta}^{\text{T}}z},$$

(2)

where *f _{ξ}*(·) is nuisance parameter and

Now consider a size-biased sampling indicator *S*, being 1 when *X* is selected and 0 otherwise, such that pr{*S* = 1 | *X* = *x*, *Z* = *z*} = pr{*S* = 1 | *X* = *x*} *x*. Note that this selection probability depends on *X* only. Then the size-biased density function becomes

$${f}_{X,Z\mid S}(x,z\mid S=1)=\frac{\text{pr}\{S=1\mid X=x,Z=z\}{f}_{X,Z}(x,z)}{\text{pr}\{S=1\}}=\frac{x{f}_{X,Z}(x,z)}{{\mu}_{X}},$$

(3)

which is identical to those in Cristóbal and Alcalá (2000) and Wu (2000). Furthermore, under the assumed log-linear regression model (1), we have

$${f}_{X\mid Z,S}(x\mid z,S=1)=\frac{x{f}_{X\mid Z}(x\mid z)}{\mu (X\mid z)}=\frac{x{f}_{\xi}(x{e}^{{\beta}^{\text{T}}z}){e}^{{\beta}^{\text{T}}z}}{{\mu}_{\xi}{e}^{-{\beta}^{\text{T}}z}},$$

where
$\mu (X\mid z)={\int}_{0}^{\infty}x{f}_{X\mid Z}(x\mid z)dx$ and *μ _{ξ}* =

Assume that there exists a random variable η = exp(ε) with the density function f_{η}(·), where f_{η}(y) = yf_{ξ}(y)/μ_{ξ}. Under size-biased sampling, model (1) satisfies that

$${f}_{X\mid Z,S}(x\mid z,S=1)={f}_{\eta}(x{e}^{{\beta}^{\text{T}}z}){e}^{{\beta}^{\text{T}}z}.$$

(4)

According to (4), the size-biased (*X*, *Z*), say (*Y*, *Z*), satisfies log *Y* = −*β*^{T}*Z* + *ε*. Comparing it with (2), we find that (*Y*, *Z*) would follow a model almost identical to that of (*X*, *Z*) for the same *β*, except for the nuisance parameters. Property 1 is hence called an “invariance property” of model (1) in presence of size-biased sampling. An illustrative example is provided in the web supplementary materials.

In random sampling, a collected dataset would usually consist of *n* independent and identically distributed (iid) copies of (*X*, *Z*), (*X _{i}*,

$${L}_{1}(\beta ;X,Z)=\prod _{i=1}^{n}{f}_{X,Z}({X}_{i},{Z}_{i})\propto \prod _{i=1}^{n}{f}_{\xi}({X}_{i}{e}^{{\beta}^{\text{T}}{Z}_{i}}){e}^{{\beta}^{\text{T}}{Z}_{i}}.$$

(5)

Now suppose that data are instead collected subject to size-biased sampling (3). That is, the actual collected data consist of *n* iid copies of {(*X*, *Z*) | *S* = 1}, {(*X _{i}*,

where and are the supports of *X* and *Z*, respectively, then by Property 1, the likelihood function becomes proportional to

$$\begin{array}{l}{L}_{2}(\beta ;X,Z\mid S=1)=\prod _{i=1}^{n}{f}_{X\mid Z}({X}_{i},{Z}_{i}\mid {S}_{i}=1)=\prod _{i=1}^{n}{f}_{X\mid Z,S}({X}_{i}\mid {Z}_{i},{S}_{i}=1){f}_{Z\mid S}({Z}_{i}\mid {S}_{i}=1)\\ \propto \prod _{i=1}^{n}{f}_{X\mid Z,S}({X}_{i}\mid {Z}_{i},{S}_{i}=1)=\prod _{i=1}^{n}\frac{{X}_{i}{f}_{\xi}({X}_{i}{e}^{{\beta}^{\text{T}}{Z}_{i}}){e}^{{\beta}^{\text{T}}{Z}_{i}}}{{\mu}_{\xi}{e}^{-{\beta}^{\text{T}}{Z}_{i}}}=\prod _{i=1}^{n}{f}_{\eta}({X}_{i}{e}^{{\beta}^{\text{T}}{Z}_{i}}){e}^{{\beta}^{\text{T}}{Z}_{i}}.\end{array}$$

If *f _{ξ}*(·) is parametric, the usual maximum likelihood estimation (MLE) can be used to estimate

To simplify our notations, we drop the notation *S* and denote the size-biased data by (*Y _{i}*,

$$\lambda (y\mid {Z}_{i})={\lambda}_{\eta}(y{e}^{{\beta}^{\text{T}}{Z}_{i}}){e}^{{\beta}^{\text{T}}{Z}_{i}}.$$

(6)

Let *N _{i}*(

$$\begin{array}{l}{l}_{\beta}(\beta )=\frac{\partial {L}_{2}(\beta )}{\partial \beta}=\sum _{i=1}^{n}\frac{\partial}{\partial \beta}\left\{log\lambda ({Y}_{i}\mid {Z}_{i};\beta )-{\int}_{0}^{{Y}_{i}}\lambda (y\mid {Z}_{i};\beta )dy\right\}\\ =\sum _{i=1}^{n}{\int}_{0}^{\infty}\left\{\frac{\partial log\lambda (y\mid {Z}_{i};\beta )}{\partial \beta}\right\}\{{dN}_{i}(y)-{\mathrm{\Delta}}_{i}(y)\lambda (y\mid {Z}_{i};\beta )dy\}.\end{array}$$

Apparently, because *λ _{η}*(·) is not specified and hence unknown,

For model (1), however, *λ _{η}*(·) itself is an infinite-dimensional nuisance parameter. To estimate

Suppose that *β*_{0} is the true value of *β*. Since *EN _{i}*(

$$\begin{array}{l}E\{{dN}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})\}=dF(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}\mid {Z}_{i})=f(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}\mid {Z}_{i}){e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}dy\\ =\overline{F}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}\mid {Z}_{i})\lambda (y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}\mid {Z}_{i}){e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}dy=\overline{F}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}\mid {Z}_{i})d{\mathrm{\Lambda}}_{\eta}(y),\end{array}$$

(7)

where (*y* | *Z _{i}*) = 1 −

$$\sum _{i=1}^{n}\left\{{dN}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})-{\mathrm{\Delta}}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})d{\mathrm{\Lambda}}_{\eta}(y)\right\}=0.$$

(8)

By solving (8) we obtain a nonparametric estimator for Λ* _{η}*( ·),

$${\widehat{\mathrm{\Lambda}}}_{\eta}(y;{\beta}_{0})={\int}_{0}^{y}\frac{{\sum}_{i}{dN}_{i}(u{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})}{{\sum}_{i}{\mathrm{\Delta}}_{i}(u{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})}.$$

(9)

Let
${M}_{i}(y)={N}_{i}(y)-{\int}_{0}^{y}{\mathrm{\Delta}}_{i}(u)d\mathrm{\Lambda}(u\mid {Z}_{i})$, *i* = 1, 2, …, *n*. Then {
${M}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})$} are martingales with respect to the filtration defined by
${\mathcal{F}}_{y}=\sigma \{{N}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}),{\mathrm{\Delta}}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}),{Z}_{i}\}$. As a result, * _{η}*(

Similar to (7), we also notice that
$E\{{Z}_{i}{dN}_{i}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}})\}={Z}_{i}\overline{F}(y{e}^{-{\beta}_{0}^{\text{T}}{Z}_{i}}\mid {Z}_{i})d{\mathrm{\Lambda}}_{\eta}(y)$. Thus with the consistent estimator of * _{η}*( ·) obtained in (9), we propose to use the following estimating equations to estimate the regression parameter

$$\sum _{i=1}^{n}{\int}_{0}^{\infty}{Z}_{i}\left\{{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}})-{\mathrm{\Delta}}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}})d{\widehat{\mathrm{\Lambda}}}_{\eta}(y;\beta )\right\}=0,$$

(10)

to solve for *β*. By some algebraic manipulation, the above equations become equivalent to *U _{n}*(

$${U}_{n}(y;\beta )={n}^{-1/2}\sum _{i=1}^{n}{\int}_{0}^{y}\{{Z}_{i}-\overline{Z}(y;\beta )\}{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}}).$$

(11)

Here, (*y; β*) = ^{(1)}(*y; β*)/^{(0)}(*y; β*) with
${\mathcal{E}}^{(k)}(y;\beta )={n}^{-1}{\sum}_{i}{Z}_{i}^{\otimes k}{\mathrm{\Delta}}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}})$, *k* = 0, 1, 2, where is the Kronecker matrix product that defines *v*^{0} = 1, *v*^{1} = *v* and *v*^{2} = *vv*^{T} for a vector *v*. In general, because *U _{n}*(

Assume that lim_{n}_{→ ∞} ^{(}^{k}^{)}(*y; β*) = *e*^{(}^{k}^{)}(*y; β*). We have the following asymptotic results:

Under the regularity conditions specified in the Appendix, _{n} is consistent, and

where

$$D={\int}_{0}^{\infty}\frac{{\lambda}_{\eta}^{\prime}(y)}{{\lambda}_{\eta}(y)}\left\{{e}^{(2)}(y)-\frac{{e}^{(1)}{(y)}^{\otimes 2}}{{e}^{(0)}(y)}\right\}{f}_{\eta}(y)dy,\mathit{and}\phantom{\rule{0.16667em}{0ex}}V={\int}_{0}^{\infty}\left\{{e}^{(2)}(y)-\frac{{e}^{(1)}{(y)}^{\otimes 2}}{{e}^{(0)}(y)}\right\}{f}_{\eta}(y)dy.$$

Details of proof can be found in the web supplementary materials.

To apply the asymptotic results, we need to find consistent estimators for *D*^{−1}*V* (*D*^{−1})^{T}. A straightforward way is to find consistent estimators for *V* and *D*, respectively. For example,
$\widehat{V}={n}^{-1}{\sum}_{i}{\int}_{0}^{\infty}\{{\mathcal{E}}^{(2)}(y;{\widehat{\beta}}_{n})-{\mathcal{E}}^{(1)}{(y;{\widehat{\beta}}_{n})}^{\otimes 2}/{\mathcal{E}}^{(0)}(y;{\widehat{\beta}}_{n})\}{dN}_{i}(y{e}^{-{\widehat{\beta}}_{n}^{\text{T}}{Z}_{i}})$. For *D*, it is less straight-forward because of the unknown *λ _{η}*(·). One approach was suggested in Tsiatis (1990) by a smoothing kernel of

In addition to the estimation of regression parameter *β*, we may also be interested in estimating the distribution function of *F _{η}*(·) in model (6), and further the distribution function of

$${\widehat{F}}_{\eta}(y;\beta )={n}^{-1}\sum _{i=1}^{n}I\{{\widehat{\eta}}_{i}(\beta )\le y\}={n}^{-1}\sum _{i=1}^{n}{N}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}}).$$

Moreover, according to Property 1, we know that
${F}_{\xi}(x)={\mu}_{\xi}{\int}_{0}^{x}{y}^{-1}d{F}_{\eta}(y)$. To estimate *μ _{ξ}*, we consider the harmonic mean of

$${\widehat{\mu}}_{\xi}(\beta )=\frac{n}{{\sum}_{i}{\widehat{\eta}}_{i}{(\beta )}^{-1}}=\frac{n}{{\sum}_{i}{Y}_{i}^{-1}{e}^{-{\beta}^{\text{T}}{Z}_{i}}}.$$

Therefore, an estimate of *F _{ξ}*(

$$\begin{array}{l}{\widehat{F}}_{\xi}(x,\beta )={\widehat{\mu}}_{\xi}(\beta ){\int}_{0}^{x}{y}^{-1}d{\widehat{F}}_{\eta}(y;\beta )=\frac{n}{{\sum}_{i}{Y}_{i}^{-1}{e}^{-{\beta}^{\text{T}}{Z}_{i}}}\xb7{\int}_{0}^{x}{y}^{-1}\left\{\frac{{\sum}_{i}{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}})}{n}\right\}\\ =\frac{n}{{\sum}_{i}{Y}_{i}^{-1}{e}^{-{\beta}^{\text{T}}{Z}_{i}}}\xb7{n}^{-1}\sum _{i=1}^{n}\frac{{N}_{i}(x{e}^{-{\beta}^{\text{T}}{Z}_{i}})}{{Y}_{i}{e}^{{\beta}^{\text{T}}{Z}_{i}}}=\sum _{i=1}^{n}\frac{{N}_{i}(x{e}^{-{\beta}^{\text{T}}{Z}_{i}})}{{Y}_{i}{e}^{{\beta}^{\text{T}}{Z}_{i}}}/\sum _{i=1}^{n}\frac{1}{{Y}_{i}{e}^{{\beta}^{\text{T}}{Z}_{i}}},\end{array}$$

which is a weighted average of rescaled *N _{i}*(·)’s. Thus,

Let *G _{η}*(

$$\left(\frac{{\phi}_{l}(y)}{1+{\phi}_{l}(y)},\frac{{\phi}_{u}(y)}{1+{\phi}_{u}(y)}\right),$$

where

$${\phi}_{l}(y)={\widehat{F}}_{\eta}(y)/{\widehat{\overline{F}}}_{\eta}(y)exp[-1.96\widehat{\sigma}(y,y)/\{{\widehat{F}}_{\eta}(y){\widehat{\overline{F}}}_{\eta}(y)\}]$$

and

$${\phi}_{u}(y)={\widehat{F}}_{\eta}(y)/{\widehat{\overline{F}}}_{\eta}(y)\xb7exp[1.96\widehat{\sigma}(y,y)/\{{\widehat{F}}_{\eta}(y){\widehat{\overline{F}}}_{\eta}(y)\}].$$

Here, (*y*, *y*) is an estimated standard error of *G*(*y*) and
${\widehat{\overline{F}}}_{\eta}(y)=1-{\widehat{F}}_{\eta}(y)$. Similar procedure can be applied to *G _{ξ}*(

Our proposed estimating functions can be extended to include more general weight functions, *w _{n}*(

$${U}_{n}^{w}(\beta )={n}^{-1/2}\sum _{i=1}^{n}{\int}_{0}^{\infty}{w}_{n}(y;\beta )\phantom{\rule{0.16667em}{0ex}}\{{Z}_{i}-\overline{Z}(y;\beta )\}{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}})=0.$$

(12)

Let
${D}_{w}={\int}_{0}^{\infty}w(y){\lambda}_{\eta}^{\prime}(y)/{\lambda}_{\eta}(y)\{{e}^{(2)}(y)-{e}^{(1)}{(y)}^{\otimes 2}/{e}^{(0)}(y)\}{f}_{\eta}(y)dy$ and
${V}_{w}={\int}_{0}^{\infty}w{(y)}^{2}\{{e}^{(2)}(y)-{e}^{(1)}{(y)}^{\otimes 2}/{e}^{(0)}(y)\}{f}_{\eta}(y)dy$. Denote
${\widehat{\beta}}_{n}^{w}$ the solution in (12). Under the assumed regularity conditions,
${\widehat{\beta}}_{n}^{w}$ is consistent and
${n}^{1/2}({\widehat{\beta}}_{n}^{w}\u2010{\beta}_{0})$ →
$\{0,{D}_{w}^{\u20101}{V}_{w}{({D}_{w}^{\u20101})}^{\text{T}}\}$. The sample-based method can be similarly used to estimate the variance of
${n}^{1/2}({\widehat{\beta}}_{n}^{w}-{\beta}_{0})$. In addition, by an application of Cauchy-Schwarz inequality, the optimal weight function for the weighted estimating functions in (12) should be proportional to
${w}_{\text{opt}}(y)={\lambda}_{\eta}^{\prime}(y)/{\lambda}_{\eta}(y)$. It is indeed shown in Ritov (1990) that
${\widehat{\beta}}_{n}^{w}$ of *w*_{opt}(*y*) is the most efficient estimator among all of the semiparametric estimators under model (6). It is however yet to see a broader application of *w*_{opt}(*y*) in practice given the challenge in its estimation.

Another prominent choice of weight function is *w _{g}*(

$${U}_{n}^{g}(\beta )={n}^{-3/2}\sum _{i=1}^{n}\sum _{j=1}^{n}({Z}_{i}-{Z}_{j})I\{log{Y}_{i}-log{Y}_{j}\le -{\beta}^{\text{T}}({Z}_{i}-{Z}_{j})\}.$$

Therefore, solving
${U}_{n}^{g}(\beta )=0$ amounts to minimizing *n*^{−1} Σ_{i}_{,}* _{j}* min{log

$${\widehat{\beta}}_{(k)}=arg\underset{\beta}{min}\sum _{i,j}\overline{w}\{log{Y}_{i}+{\widehat{\beta}}_{(k-1)}^{\text{T}}{Z}_{i};{\widehat{\beta}}_{(k-1)}\}min\{log{Y}_{i}-log{Y}_{j}+{\beta}^{\text{T}}({Z}_{i}-{Z}_{j}),0\},$$

where (*β*) = *w*(*y; β*)/*w _{g}*(

One use of the weighted estimating equations is in model adequacy assessment. As proposed in Gill and Schumacher (1987) for the proportional hazards model, the rationale is that the estimates based on different weighted estimating equations should be reasonably close to each other if the assumptions of the proportional hazards model do hold, and should differ otherwise. This same rationale can be applied to the adequacy assessment of model (1). Specifically, consider the estimating equations in (12) for two different weight functions, *w _{n}*

Due to the difficulty in estimating *λ _{η}*(·) and

$$\left[\begin{array}{cc}{V}_{11}({\beta}_{0})& {V}_{12}({\beta}_{0})\\ {V}_{21}({\beta}_{0})& {V}_{22}({\beta}_{0})\end{array}\right],$$

where

$$\begin{array}{l}{V}_{11}(\beta )={n}^{-1}\sum _{i}{\int}_{0}^{\infty}{w}_{n,1}^{2}(y;\beta ){\{{Z}_{i}-\overline{Z}(y;\beta )\}}^{\otimes 2}{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}}),\\ {V}_{12}(\beta )={V}_{21}(\beta )={n}^{-1}\sum _{i}{\int}_{0}^{\infty}{w}_{n,1}(y;\beta ){w}_{n,2}(y;\beta ){\{{Z}_{i}-\overline{Z}(y;\beta )\}}^{\otimes 2}{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}}),\end{array}$$

and

$${V}_{22}(\beta )={n}^{-1}\sum _{i}{\int}_{0}^{\infty}{w}_{n,2}^{2}(y;\beta ){\{{Z}_{i}-\overline{Z}(y;\beta )\}}^{\otimes 2}{dN}_{i}(y{e}^{-{\beta}^{\text{T}}{Z}_{i}}).$$

Thus, the following statistic can be used in the goodness-of-fit assessment of model adequacy:

$$T=\underset{\beta \in U(\xb7{\widehat{\beta}}_{n,1}^{w})}{min}\left\{{\left[\begin{array}{c}{U}_{n,1}^{w}(\beta )\\ {U}_{n,2}^{w}(\beta +{\widehat{\beta}}_{n,1}^{w}-{\widehat{\beta}}_{n,2}^{w})\end{array}\right]}^{\text{T}}{\left[\begin{array}{cc}{V}_{11}({\beta}_{0})& {V}_{12}({\beta}_{0})\\ {V}_{21}({\beta}_{0})& {V}_{22}({\beta}_{0})\end{array}\right]}^{-1}\left[\begin{array}{c}{U}_{n,1}^{w}(\beta )\\ {U}_{n,2}^{w}(\beta +{\widehat{\beta}}_{n,1}^{w}-{\widehat{\beta}}_{n,2}^{w})\end{array}\right]\right\},$$

(13)

which is asymptotically ${\chi}_{p}^{2}$-distributed as shown in Wei, Ying and Lin (1990).

Simulation studies are conducted to assess both semiparametric methods and the MLE methods for the proposed linear regression model. In our simulations, we consider the linear regression model (1) of log *X* = −*β*^{T}*Z* + *ε*, where two distributions are chosen for the random variable *ε*, i.e.,

- a standard Normal distribution with the density function of (2
*π*)^{−1/2}exp(−*s*^{2}/2), and - an Extreme-value distribution with the density function of 2 exp(2
*s*−*e*^{2}).^{s}

Hence for *ξ* = exp(*ε*), the density function *f _{ξ}*(·)’s are Log-normal (2

In each simulation, we generate a sample of *n* iid copies of size-biased (*Y*, *Z*) according to the model of
$logX=-{\beta}_{0}^{\text{T}}Z+\epsilon $. Here, *β*_{0} is the true value of *β*, and *Z* are simulated according to a uniform distribution *U*[0, 1] and hence continuous. We use four methods to estimate *β* in model (1):

- Para-L: MLE method with underlying Log-normal distribution,
- Para-W: MLE method with underlying Weibull distribution,
- Semi-L: semiparametric method with
*w*(*y*) 1, and - Semi-G: semiparametric method with the Gehan weight function
*w*(·)._{g}

In particular, when the underlying distribution is Log-normal, Para-W represents an incorrect MLE method using the Weibull distribution. Similarly, when the underlying distribution is Weibull, Para-L represents an incorrect MLE method using the Log-normal distribution.

Simulation results are tabulated in Table 1. In this table, *n* is chosen to be 50, 200, and 500, representing small, moderate and large sample sizes, respectively. The true value *β*_{0} is chosen to be 0 and 1, representing the null and a specific alternative hypotheses, respectively. Each cell in the table is calculated with 10,000 simulated samples. A bias is calculated as the average of 10,000 ( − *β*_{0})’s. A coverage probability is calculated as the percentage of 10,000 95% nominal confidence intervals (CI) containing *β*_{0}. The sample standard error (SSE) of 10,000 ’s and the mean of 10,000 estimated standard errors (MSE) are also calculated.

As shown in Table 1, the bias of for each method in this simulation setup is mostly close to zero, even though the MLE methods may specify incorrect underlying distributions for *f _{ξ}*( ·). The coverage probabilities for a correct Para-L or Para-W, i.e., an MLE method with correctly assumed underlying distributions, and Semi-L or Semi-W are generally close to the nominal level of 95%, while an incorrect Para-L or Para-W deviates from 95% notably. Similarly, the SSE and MSE are generally close to each other when the correct MLE methods and the semiparametric methods are used, but otherwise for the incorrect MLE methods.

As far as the semiparametric methods and the correct MLE methods are compared, the correct MLE methods generally yield smaller SSE and MSE, which means the semiparametric methods are yet to be fully efficient. As shown in the table, however, the efficiency of semiparametric methods may be potentially improved by choosing different weight functions. For example in the current simulation setup, the Gehan weight function may yield smaller variances, compared with those of *w _{g}*(·) 1. As a summary, the correct MLE methods generally outperform the semiparametric methods. The validity of MLE methods however depend on whether or not the assumptions are correct. The semiparametric methods are generally more reliable with respect to different underlying distributions. Their efficiencies may be improved by choosing appropriate weight functions.

In this section, we use our proposed methods to analyze the real data examples that are introduced in §1: one is the classical dataset of shrub widths in Muttlak (1988), and the other is the tumor size dataset in Kimmel and Flehinger (1991). To save space, we present major analysis results in this article.

The original dataset of shrub widths contains data on 89 shrub samples. In addition to the measurement of shrub widths (Width), two more attributes of mountain mahogany, maximum height (Height) and number of stems (Stem), were measured. Both attributes are important predictors of shrub width in estimating vegetation coverage of mountain mahogany. To ensure uniform coverage over the study area, two independent replications (Replicate), each with three systematically placed parallel transects (Transect), were established (Muttlak, 1988).

As shown in the scatter plots of Figure 2, when either attribute of Height or Stem increases, Width tends to increase. There appears, however, a quadratic relationship between Width and Stem. It is plausible that there may exist systematic deviation in Replicate and Transect. As shown in the boxplots of Figure 2, there appears some distributional difference in Width between different Replicates, although this difference is not seemingly prominent. There also appear distributional differences in Width between Transects I and II, and between Transects I and III. Some univariate summary statistics of the outcome variable and four covariates are in Table 2.

Exploratory plots of shrub data: (a) boxplot of Width by Replicate; (b) boxplot of Width by Transect; (c) scatter plot of Width versus Height; (d) scatter plot of Width versus Stem

To examine the association between Width and the two attributes adjusting for Replicate and Transect, we consider a linear regression model log *X* = −*β*^{T}*Z* + *ε*, where *X* is the outcome variable Width, and Z is the covariate vector including Height, Stem, Replicate and Transect, in regression analysis. Estimates of regression parameters are tabulated in Table 3. As shown in the table, all four methods, i.e., Para-L, Para-W, Semi-L and Semi-W, yield that Height and Stem are significant predictors, adjusting for Replicate and Transect, by their 95% confidence intervals (CI). This implies that taller shrubs with more stems are associated with greater width spread and hence more vegetation coverage, and per unit increase in Height and Stem would lead to about 118% and 2.5% increase in Width, respectively.

For the seemingly quadratic association with Stem, we include an additional squared Stem in the model, and find that Width is not significantly associated with the added quadratic term. It is however still significantly associated with Height and Stem, with or without Replicate and Transect. Applying the goodness-of-fit test in (13) to the model with all covariates, we obtain *T* = 9.13 ~ *χ*^{2}(*p* = 0.10, *df* = 5), which does not reject the models adequacy.

In Kimmel and Flehinger (1991), a lung cancer dataset was used to examine the relationship between the occurrence of metastases and the size of primary cancers. In the dataset, lung cancers were diagnosed in a population of male smokers over 45 years old who enrolled voluntarily in a randomly controlled trial to evaluate the use of sputum cytology. Two types of lung cancer were detected: adenocarcinomas by radiologic screening or patients symptoms, and epidermoid cancers by sputum cytology, chest X-ray or patients symptoms. The diagnosis of metastases was based on the then best available staging, clinical, surgical, or pathological readings. Among the total 228 patients, there were 141 adenocarcinomas and 87 epidermoid cancers. Tumor sizes were determined by the geometric means of recorded dimensions: resected cancers were measured directly, and nonresected cancers were mainly measured on radiography. More descriptive details can be found in Ghosh (2008).

As pointed out in Ghosh (2008), a complication in analyzing tumor size in this dataset is the presence of size-biased sampling, i.e., tumors detected in the screening program tend to depend on their growth and hence the sizes themselves. In Ghosh (2008), tumor sizes were treated as time-to-event variables. Their hazard functions were modeled by the proportional hazards model. Therefore, the regression parameters would be interpreted in relative risk of tumor sizes. Additional summary statistics and exploratory analysis results of this dataset can be found in Ghosh (2008).

To examine the association between tumor size and the occurrence of metastases in size-biased sampling, we consider the proposed linear regression model. We first fit a linear regression model with the presence/absence of metastases only. The regression parameter estimate of equals 0.45 with a 95% CI (0.12, 0.78). This means that tumor size is about 56% significantly greater in presence of metastases. However, when including the cancer types of adenocarcinomas and epidermoid in the model, becomes 0.24 with a 95% CI (−0.07, 0.57), which is not statistically significant. On the other hand, the regression parameter estimate for cancer types is 0.65 with a 95% CI (0.45, 0.86). This implies that tumor size tends to be 91% greater in epidermoid cancers than that in adenocarcinomas. Our finding is consistent with the relative risk calculation in Ghosh (2008), although larger sample size may be needed to confirm a significant positive association in relative risk. Applying the goodness-of-fit test in (13) to the model with both covariates, we obtain *T* = 4.97 ~ *χ*^{2}(*p* = 0.08, *df* = 2), which does not reject the models adequacy, either.

Our proposed regression model is essentially a semiparametric linear model. In the statistical literature, semiparametric linear models have been extensively studied. For example, the accelerated failure time model is a classical example of semiparametric linear model in time-to-event analysis (Kalbfleisch and Prentice, 2002, p. 218). Additional examples include semiparametric linear models in curve regression analysis (Hardle and Marron, 1990), generalized linear regression (Chen, 1995) and partial linear regression (Bhattacharya and Zhao, 1997).

One prominent assumption in our proposed model is the choice of log-transformation. Because outcomes in size-biased sampling are mostly positive, a log-transformation would relax the restriction on *β* to allow a wide range of covariates *Z*. In addition, for log-transformed *X*, our proposed model essentially assumes a multiplicative association between *X* and *Z*. It is the multiplicative association that leads to the critical invariance property, which greatly simplifies our estimation. Other transformations, such as identity, are yet to be seen to have these advantages.

However, our proposed model can be generalized to, for example, log *X* = *h*(*X*, *β*) + *ε* for some function *h*(*X*, *β*). Under this generalized model, the invariance property still holds in size-biased sampling. This shall lead to further work. Specifically, an alternative class of regression models can assume that

$$logX=h({\beta}^{\text{T}}X)+\epsilon ,$$

where *h*(·) is unknown, but *ε* follows a known distribution, such as a standard Normal distribution, to avoid potential identifiability issue. This model shall further relax model assumptions and assist with model adequacy assessment.

As pointed by an Associate Editor, the use of hazard functions for outcomes other than censored time-to-event seems unnatural. This is in fact why semiparametric linear regression model may be more appealing to model outcomes, such as shrub width or tumor size, than the usual proportional hazards model. Nevertheless, we extensively use (cumulative) hazards functions in this manuscript to develop our estimation procedure, because of the simple representation of baseline cumulative hazard function in (9). This simple representation ultimately facilitates a straightforward estimation of *β*. The advantage of hazard functions is also seen in developing the asymptotic theory of Theorem 2. Otherwise the expression of *D* would be more complex.

When *X* is time-to-event, regression models based on hazard functions are appealing. When time-to-event is size-biased, it is usually called length-biased. Researchers have been studying censored length-biased time-to-event in the statistical literature. For example, Vardi (1989) estimated the lifetime distribution under multiplicative censorship; Asgharian, M’Lan and Wolfson (2002) developed an unconditional approach to studying length-biased lifetimes; and a recent work by Cristóbal, Alcalá and Ojeda (2007) proposed some nonparametric estimation from backward recurrence times. More work is needed to extend the proposed linear regression model to censored time-to-event.

Details of those referenced in Section 2 are available in the web supplementary materials under the Paper Information link at the *Biometrics* web site http://www.tibs.org/biometrics.

The author wishes to thank an Associate Editor and two reviewers for their constructive comments, and also thank Professor Debashis Ghosh for his references to an earlier work and the tumor size dataset. This research is supported in part by the National Institutes of Health grants.

- Ahmad IA. On multivariate kernel estimation for samples from weighted distributions. Statistics & Probability Letters. 1995;22:121–129.
- Asgharian M, M’Lan CE, Wolfson DB. Length-biased sampling with right censoring: An unconditional approach. Journal of the American Statistical Association. 2002;97:201–209.
- Bhattacharya PK, Zhao PL. Semiparametric inference in a partial linear model. Annals of Statistics. 1997;25:244–262.
- Canfield RH. Application of the line intercept method in sampling range vegetation. Journal of Forestry. 1941;39:388–394.
- Chen H. Asymptotically efficient estimation in semiparametric generalized linear models. Annals of Statistics. 1995;23:1102–1129.
- Chen YQ, Jewell NP. On a general class of hazards regression models. Biometrika. 2001;88:687–702.
- Cox DR. Some sampling problems in technology. In: Johnson NL, Smith H H, editors. New Developments in Survey Sampling. New York: Wiley-Interscience; 1969. pp. 506–527.
- Cox DR. Regression models and life-tables (with discussion) Journal of Royal Statistical Society Series B. 1972;34:187–220.
- Cristóbal JA, Alcalá JT. Nonparametric regression estimators for length biased data. Journal of Statistical Planning and Inferences. 2000;89:145–168.
- Cristóbal JA, Alcalá JT, Ojeda JL. Nonparametric estimation of a regression function from backward recurrence times in a cross-sectional sampling. Lifetime Data Analysis. 2007;13:273–293. [PubMed]
- Davidov O, Zelen M. Referent sampling, family history and relative risk: the role of length-biased sampling. Biostatistics. 2001;2:173–181. [PubMed]
- Duan N. Smearing estimate: A nonparametric retransformation methods. Journal of the American Statistical Association. 1983;78:605–610.
- Fleming TR, Harrington DP. Counting Processes and Survival Analysis. New York: Wiley; 1991.
- Ghosh D. Proportional hazards regression for cancer studies. Biometrics. 2008;64:141–148. [PubMed]
- Gill RD, Schumacher M. A simple test of the proportional hazards assumption. Biometrika. 1987;74:289–300.
- Hardle W, Marron JS. Semiparametric comparison of regression curves. Annals of Statistics. 1990;18:63–89.
- Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353.
- Jin Z, Ying Z, Wei LJ. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390.
- Jones MC. Kernel density estimation for length biased data. Biometrika. 1991;78:511–519.
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2. New York: Wiley; 2002.
- Kimmel M, Flehinger BJ. Nonparametric estimation of size-metastasis relationship in solid cancers. Biometrics. 1991;47:987–1004. [PubMed]
- Koenker R, Bassett GS. Regression quantiles. Econometrica. 1978;46:33–50.
- Lin DY. Goodness of fit for the Cox regression model based on a class of parameter estimators. Journal of the American Statistical Association. 1991;86:725–728.
- Muttlak HA. Ph.D. Dissertation. Laramie: University of Wyoming; 1988. Some Aspects of Ranked Set Sampling with Size Biased Probability of Selection.
- Muttlak HA, McDonald LL. Ranked set sampling with size-based probability of selection. Biometrics. 1990;46:435–446.
- Parzen MI, Wei LJ, Ying Z. A resampling method based on pivotal estimating functions. Biometrika. 1994;81:341–350.
- Patil GP, Rao CR. Weighted distributions and size-based sampling with applications to wildlife populations and human families. Biometrics. 1978;34:179–189.
- Rao CR, Zhao LC. Approximation to the distribution of
*M*-estimates in linear models by randomly weighted bootstrap. Sankhyâ A. 1992;54:323–331. - Ritov Y. Estimation in a linear regression model with censored data. Annals of Statistics. 1990;18:303–328.
- Simon R. Length biased sampling in etiologic studies. American Journal of Epidemiology. 1980;111:444–452. [PubMed]
- Tsiatis AA. Estimating regression parameter using linear rank tests for censored data. Annals of Statistics. 1990;18:354–372.
- Vardi Y. Nonparametric estimation in the presence of length bias. Annals of Statistics. 1982;10:616–620.
- Vardi Y. Empirical distributions in selection bias models. Annals of Statistics. 1985;13:178–203.
- Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika. 1989;76:751–761.
- Wang MC. Hazards regression analysis for length-biased data. Biometrika. 1996;83:343–354.
- Wei LJ, Ying Z, Lin DY. Linear regression analysis of censored survival data based on rank tests. Biometrika. 1990;77:845–851.
- Wu CO. Local polynomial regression with selection biased data. Statistica Sinica. 2000;10:789–817.
- Zelen M. Forward and backward recurrence times and length biased sampling: age specific models. Lifetime Data Analysis. 2005;10:325–334. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |