Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2760016

Formats

Article sections

- SUMMARY
- 1. INTRODUCTION
- 2. NOTATION AND ASSUMPTIONS
- 3. GLOBAL GOODNESS-OF-FIT TESTS
- 4. SIMULATION EVIDENCE
- 5. APPLICATIONS
- 6. DISCUSSION
- References

Authors

Related links

Stat Med. Author manuscript; available in PMC 2010 October 15.

Published in final edited form as:

Stat Med. 2009 October 15; 28(23): 2912–2928.

doi: 10.1002/sim.3678PMCID: PMC2760016

NIHMSID: NIHMS147356

In a variety of biomedical applications, particularly those involving screening for infectious diseases, testing individuals (e.g., blood/urine samples, etc.) in pools has become a standard method of data collection. This experimental design, known as group testing (or pooled testing), can provide a large reduction in testing costs and can offer nearly the same precision as individual testing. To account for covariate information on individual subjects, regression models for group testing data have been proposed recently. However, there are currently no tools available to check the adequacy of these models. In this paper, we present various global goodness-of-fit tests for regression models with group testing data. We use simulation to examine small-sample size and power properties of the tests for different pool composition strategies. We illustrate our methods using two infectious disease data sets, one from an HIV study in Kenya and one from the Infertility Prevention Project.

When screening subjects for sexually transmitted diseases such as HIV, it is often more practical, at least initially, to test subjects in pools rather than to test them individually. Whether the goal is to estimate the prevalence of infection or to identify those who are infected, the argument for testing subjects in pools through group testing has been made repeatedly in the statistical literature [1]–[5], and the infectious disease literature is replete with empirical evidence showing the profound advantages of pooling; e.g., [6]–[9]. The advantages of group testing for estimation and identification in drug discovery are also well known [10, 11].

In this paper, we consider regression models for a pooled binary response from group testing when covariate information is available for each subject in the pool. This type of model falls outside the usual binary regression model framework because the individual binary responses (e.g., infection statuses, etc.) are not necessarily observed. Motivated by a Kenyan HIV study [6], Vansteelandt et al. [1] propose a likelihood-based approach to model the probability of infection, as a function of fixed subject-specific covariates, using the observed initial pooled testing results. Xie [2] presents a conceptually different approach to fit fixed effects group testing regression models by treating individual statuses as latent and using the EM algorithm. This approach is more flexible than that in [1] because, among other things, one can incorporate testing results from retests; e.g., additional results obtained from retesting subsets of positive pools. More recently, Chen et al. [12] propose a generalization of the approach in [1] to include effects which are best regarded as random.

Modeling group testing data has been an important advance, because it allows one to take advantage of risk factors through the covariates while acknowledging population heterogeneity. This sharply contrasts with previous research in group testing which has largely required the individual probability of positivity to be the same for each subject. However, despite this methodological advance, there are currently no analytical tools one can use to assess the fit of the model. With this in mind, we propose global goodness-of-fit (GOF) tests tailored for use with the fixed effects group testing regression model proposed by Vansteelandt et al. [1]. These tests can provide needed guidance to assess the fit of a model, because graphical assessments of adequacy for this type of model are difficult to make. Our overarching approach in this paper is to consider analytical procedures which test whether or not a particular model is appropriate by detecting general departures from it.

Subsequent sections of this paper are organized as follows. In Section 2, we review the group testing regression modeling approach of Vansteelandt et al. [1]. In Section 3, we discuss the components of model adequacy and present four global GOF tests for group testing regression models. In Section 4, we consider different pool composition strategies and use simulation to characterize the performance of the tests. In Section 5, we apply our GOF procedures to two infectious disease data sets, one from the Kenyan HIV study in [1] and one from the Infertility Prevention Project. In Section 6, we summarize our findings, describe other approaches, and discuss future research in this area. Additional mathematical details are catalogued in the appendices.

Suppose that *N* individuals (e.g., blood/urine specimens, etc.) are drawn from a large population and that each individual is assigned to exactly one of *n* pools. Denote by *c _{i}* the pool size for the

$${p}_{ij}\equiv {p}_{ij}(\mathit{\beta},{\mathbf{x}}_{ij})=E({Y}_{ij}\mid {\mathbf{x}}_{ij};\mathit{\beta})=\frac{exp\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}^{\prime}{\mathbf{x}}_{ij})}{1+exp\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}^{\prime}{\mathbf{x}}_{ij})},$$

(1)

where ** β** is the (

$${p}_{i}(\mathit{\beta})=E({Z}_{i}\mid {\mathbf{x}}_{i};\mathit{\beta})={\gamma}_{1}+{\gamma}_{12}\prod _{j=1}^{{c}_{i}}\{1-{p}_{ij}(\mathit{\beta},{\mathbf{x}}_{ij})\},$$

(2)

where
${\mathbf{x}}_{i}={({\mathbf{x}}_{i1}^{\prime},{\mathbf{x}}_{i2}^{\prime},\dots ,{\mathbf{x}}_{i{c}_{i}}^{\prime})}^{\prime}$ and *γ*_{12} = 1 − *γ*_{1} − *γ*_{2}. Because the latent statuses *Y _{ij}* are independent, the log-likelihood function of

$$l(\mathit{\beta}\mid \mathbf{Z})=\sum _{i=1}^{n}\left[{Z}_{i}log\left\{\frac{{p}_{i}(\mathit{\beta})}{1-{p}_{i}(\mathit{\beta})}\right\}+log\{1-{p}_{i}(\mathit{\beta})\}\right].$$

(3)

One can maximize the log-likelihood in (3) to obtain the maximum likelihood estimate (MLE) for ** β**, denoted throughout by

Because the latent individual statuses *Y _{ij}* are independent, the group testing regression model in (2) is literally “induced” by the logistic regression model in (1), so we focus on assessing the validity of (1). The essential components of fit for the logistic model can be specified by the following assumptions [13]:

- A1. The logit transformation is the correct function linking the covariates with the conditional mean.
- A2. The linear predictor
′*β***x**is correct; that is, one does not need to include additional covariates, transformations of variables, interactions, etc._{ij} - A3. The variance of
*Y*, conditional on_{ij}**x**, is var(_{ij}*Y*|_{ij}**x**) =_{ij}*p*(_{ij},*β***x**) {1 −_{ij}*p*(_{ij},*β***x**)}._{ij}

Instead of checking A1–A3 individually (assumptions which are obviously confounded), we focus on the assessment of overall GOF with pooled testing response; that is, we aim to detect violations of any of the assumptions listed above. In doing so, we consider only the situation wherein there is no “natural grouping” of the individuals. This is likely the case when at least one of the covariates is continuous, so that the number of covariate patterns is close to or equal to *N*, the number of individuals.

We now describe four global GOF tests for (1) with pooled testing response. As the individual statuses *Y _{ij}* are not observed, GOF is assessed using the available pooled testing results

The first GOF test statistic we consider is a Pearson *χ*^{2}-type statistic, defined by

$$G=G(\widehat{\mathit{\beta}})=\sum _{i=1}^{n}\frac{{\{{Z}_{i}-{p}_{i}(\widehat{\mathit{\beta}})\}}^{2}}{{p}_{i}(\widehat{\mathit{\beta}})\{1-{p}_{i}(\widehat{\mathit{\beta}})\}},$$

(4)

where *p _{i}*(

One could mimic the approach of [14] and establish the asymptotic normality of our statistic defined in (4). However, we instead use the scaled chi-square distribution
$b{\chi}_{\nu}^{2}$ as a reference, rather than a normal distribution, to achieve better distributional properties with finite samples [13, 15]. The
$b{\chi}_{\nu}^{2}$ distribution is commonly used to approximate the distribution of nonnegative random variables [16]. The values *b* > 0 and *ν* > 0 are estimated by equating the first two (approximate) moments of *G* and those of the scaled *χ*^{2} distribution. The approach we use to obtain the moments of *G* follows the same spirit as the approach in [14], although the calculations are more complicated when considering group testing responses. It is understood that asymptotic properties apply when the number of pools *n* is large.

We sketch the derivation of the first two moments of *G* = *G*(**) here and provide full details in Appendix A. Suppose that the true value ***β*_{0} is not on the boundary of the parameter space and denote the score function by *S*(** β**) =

$$G(\widehat{\mathit{\beta}})\approx G({\mathit{\beta}}_{0})+\left\{{\frac{\partial G(\mathit{\beta})}{\partial {\mathit{\beta}}^{\prime}}|}_{\mathit{\beta}={\mathit{\beta}}_{0}}\right\}(\widehat{\mathit{\beta}}-{\mathit{\beta}}_{0}).$$

(5)

Under mild regularity conditions, it can be shown that

$${\widehat{\mu}}_{G}\equiv E\{G(\widehat{\mathit{\beta}})\}\approx E\{G({\mathit{\beta}}_{0})\}=\sum _{i=1}^{n}E\left[\frac{{\{{Z}_{i}-{p}_{i}({\mathit{\beta}}_{0})\}}^{2}}{{p}_{i}({\mathit{\beta}}_{0})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{i}({\mathit{\beta}}_{0})\}}\right]=n$$

and that var{*G*(**)}≈ var{***G*(*β*_{0})} − *C*(*β*_{0})′^{−1}(*β*_{0})*C*(*β*_{0}), where (*β*_{0}) = *E* {*S*(*β*_{0})*S*(*β*_{0})′} is the Fisher information matrix and

$$C({\mathit{\beta}}_{0})=E\left\{{\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}|}_{\mathit{\beta}={\mathit{\beta}}_{0}}\right\}.$$

We provide closed-form expressions for var{*G*(*β*_{0})}, (*β*_{0}), and *C*(*β*_{0}) in Appendix A. Denote by
${\widehat{\sigma}}_{G}^{2}$ the estimate of var{*G*(**)}, obtained by substituting **** for ***β*_{0} in the above expressions. The testing procedure we propose is outlined as follows:

- Calculate the Pearson statistic
*G*=*G*(**) and the approximate variance ${\widehat{\sigma}}_{G}^{2}$.** - Estimate the scale parameter
*b*and degrees of freedom*ν*for the scaled*χ*^{2}distribution by equating the first two moments of $b{\chi}_{\nu}^{2}$ and*G*; that is, set=_{G}*bν*and ${\widehat{\sigma}}_{G}^{2}=2{b}^{2}\nu $. Solving for*b*and*ν*, we get $b={\widehat{\sigma}}_{G}^{2}/2{\widehat{\mu}}_{G}$ and $\nu =2{\widehat{\mu}}_{G}^{2}/{\widehat{\sigma}}_{G}^{2}$. - Because
*G*/*b*follows an approximate*χ*^{2}distribution with*ν*degrees of freedom, the p-value for the test of GOF is $\text{p}=\text{pr}({\chi}_{\nu}^{2}\ge G/b)$.

For individual testing data, Stukel [17] extends the logistic regression model by adding two additional parameters *α*_{1} and *α*_{2}, on the logit scale, where *α*_{1}(*α*_{2}) controls the shape of the mean function for positive (negative) values of the linear predictor ** β**′

$${p}_{ij}(\mathit{\theta})={p}_{ij}(\mathit{\theta},{\mathbf{x}}_{ij})=E({Y}_{ij}\mid {\mathbf{x}}_{ij};\mathit{\theta})=\frac{exp\phantom{\rule{0.16667em}{0ex}}\{{h}_{{\alpha}_{0}}({\eta}_{ij})\}}{1+exp\phantom{\rule{0.16667em}{0ex}}\{{h}_{{\alpha}_{0}}({\eta}_{ij})\}},$$

(6)

where *h _{α}*

$${h}_{{\alpha}_{0}}({\eta}_{ij})=\{\begin{array}{cc}-{\alpha}_{0}^{-1}\{exp\phantom{\rule{0.16667em}{0ex}}({\alpha}_{0}\mid {\eta}_{ij}\mid )-1\},& {\alpha}_{0}>0\\ {\eta}_{ij},& {\alpha}_{0}=0\\ {{\alpha}_{0}}^{-1}\{log\phantom{\rule{0.16667em}{0ex}}(1-{\alpha}_{0}\mid {\eta}_{ij}\mid )\}& {\alpha}_{0}<0.\end{array}$$

It is worth emphasizing that *p _{ij}*(

Similar to Stukel [17], we formulate a score statistic to test *H*_{0} for the group testing regression model in [1]. Specifically, we rewrite the mean of *Z _{i}*,

$${p}_{i}(\mathit{\theta})=E({Z}_{i}\mid \mathit{\theta})={\gamma}_{1}+{\gamma}_{12}\prod _{j=1}^{{c}_{i}}\{1-{p}_{ij}(\mathit{\theta})\},$$

(7)

so that the log-likelihood based on the extended model is

$$l(\mathit{\theta}\mid \mathbf{Z})=l(\mathit{\beta},{\alpha}_{0}\mid \mathbf{Z})=\sum _{i=1}^{n}\left[{Z}_{i}log\left\{\frac{{p}_{i}(\mathit{\theta})}{1-{p}_{i}(\mathit{\theta})}\right\}+log\{1-{p}_{i}(\mathit{\theta})\}\right].$$

The score function *S*(** θ**) is given by

$$S(\mathit{\theta})=-{\gamma}_{12}\sum _{i=1}^{n}\left[\frac{\{{Z}_{i}-{p}_{i}(\mathit{\theta})\}}{{p}_{i}(\mathit{\theta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{i}(\mathit{\theta})\}}\prod _{j=1}^{{c}_{i}}\{1-{p}_{ij}(\mathit{\theta})\}\sum _{j=1}^{{c}_{i}}\frac{\partial {p}_{ij}(\mathit{\theta})/\partial \mathit{\theta}}{1-{p}_{ij}(\mathit{\theta})}\right].$$

The statistic we propose is *T _{S}* =

$${T}_{S}=\frac{{\gamma}_{12}}{2}\sum _{i=1}^{n}\left[\left\{\frac{{Z}_{i}-{\widehat{p}}_{i}}{{\widehat{p}}_{i}(1-{\widehat{p}}_{i})}\right\}\prod _{j=1}^{{c}_{i}}(1-{\widehat{p}}_{ij})\sum _{j=1}^{{c}_{i}}{\widehat{p}}_{ij}{\widehat{\eta}}_{ij}^{2}I({\widehat{\eta}}_{ij}<0)\right],$$

where * _{ij}* =

$$S(\mathit{\theta})=\frac{\partial l(\mathit{\theta}\mid \mathbf{Z})}{\partial \mathit{\theta}}=\left(\begin{array}{c}\partial l(\mathit{\theta}\mid \mathbf{Z})/\partial \mathit{\beta}\\ \partial l(\mathit{\theta}\mid \mathbf{Z})/\partial {\alpha}_{0}\end{array}\right)=\left(\begin{array}{c}S(\mathit{\beta})\\ S({\alpha}_{0})\end{array}\right)$$

and

$$\mathcal{I}(\mathit{\theta})=\left(\begin{array}{cc}{\mathbf{I}}_{\mathit{\beta}\mathit{\beta}}& {\mathbf{I}}_{\mathit{\beta}{\alpha}_{0}}\\ {\mathbf{I}}_{{\alpha}_{0}\mathit{\beta}}& {I}_{{\alpha}_{0}{\alpha}_{0}}\end{array}\right)=\left(\begin{array}{cc}E\{S(\mathit{\beta})S{(\mathit{\beta})}^{\prime}\}& E\{S(\mathit{\beta})S({\alpha}_{0})\}\\ E\{S({\alpha}_{0})S{(\mathit{\beta})}^{\prime}\}& E\{S{({\alpha}_{0})}^{2}\}\end{array}\right),$$

so that
${\sigma}_{{T}_{S}}^{2}={I}_{{\alpha}_{0}{\alpha}_{0}}-{\mathbf{I}}_{{\alpha}_{0}\mathit{\beta}}{\mathbf{I}}_{\mathit{\beta}\mathit{\beta}}^{-1}{\mathbf{I}}_{\mathit{\beta}{\alpha}_{0}}$. To obtain an estimate of
${\sigma}_{{T}_{S}}^{2}$, we substitute ** = (****′, 0)′ for **** θ** in (

In logistic regression with individual data, the distribution of the Pearson *χ*^{2} statistic can be approximated by a *χ*^{2} distribution when the expected number of observations for each covariate pattern is not small. Unfortunately, this requirement is violated when the number of covariate patterns is close to or equal to the number of observations, the situation in which we are interested. One way to address this problem is to “group” the observed individual data so that asymptotic arguments can be applied to the “grouped data” [18]–[20]. We extend this idea to applications involving pooled binary responses.

When compared to other available GOF tests for individual testing data, the Hosmer-Lemeshow (HL) test is known to have lower power in detecting certain departures from the logistic model [13, 15]. However, the HL statistic remains widely used, so we reevaluate its potential use and include it for comparison purposes with pooled response data. For clarity, when we henceforth refer to the HL test, we exclusively use the term “pool” to identify an amalgamate of individuals for group testing. We use the term “group” to represent a set of the observed pools.

The HL statistic is computed as follows. First, we fit the pooled testing regression model defined in (1) and (2), obtaining the MLE ** and model-predicted pool probabilities *** _{i}*,

$${T}_{HL}=\sum _{k=1}^{m}\frac{{\left({\sum}_{i\in {g}_{k}}{Z}_{i}-{\sum}_{i\in {g}_{k}}{\widehat{p}}_{i}\right)}^{2}}{\left({\sum}_{i\in {g}_{k}}{\widehat{p}}_{i}\right)\phantom{\rule{0.16667em}{0ex}}\left(1-{\sum}_{i\in {g}_{k}}{\widehat{p}}_{i}/{s}_{k}\right)}.$$

In general, we choose the subgroup sizes *s _{k}* to be equal. If

Presnell and Boos [21] propose a general test for misspecification in parametric models by comparing the maximized “in-sample” likelihood with the maximized “out-of-sample” likelihood. We now describe how their methodology can be applied to diagnose GOF for the group testing regression model in [1]. Let *f*(*Z _{i}*;

$$\text{IOS}=log\left\{\frac{{\prod}_{i=1}^{n}f({Z}_{i};\widehat{\mathit{\beta}})}{{\prod}_{i=1}^{n}f({Z}_{i};{\widehat{\mathit{\beta}}}_{(i)})}\right\}=\sum _{i=1}^{n}\{l({Z}_{i};\widehat{\mathit{\beta}})-l({Z}_{i};{\widehat{\mathit{\beta}}}_{(i)})\},$$

where *l*(*Z _{i}*;

$${\text{IOS}}_{A}=tr\{\widehat{I}{(\widehat{\mathit{\beta}})}^{-1}\widehat{B}(\widehat{\mathit{\beta}})\},$$

where
$\widehat{I}(\widehat{\mathit{\beta}})={n}^{-1}{\sum}_{i=1}^{n}-\{\ddot{l}({Z}_{i};\widehat{\mathit{\beta}})\}$ is the average observed information matrix and
$\widehat{B}(\widehat{\mathit{\beta}})={n}^{-1}{\sum}_{i=1}^{n}\stackrel{.}{l}({Y}_{i};\widehat{\mathit{\beta}})\stackrel{.}{l}{({Z}_{i};\widehat{\mathit{\beta}})}^{\prime}$ is the sample covariance of the score function. Note that IOS* _{A}* does not require the leave-one-out estimates, so it is far simpler to compute when

Presnell and Boos [21] show that IOS* _{A}* is approximately normal with mean equal to

We consider two strategies to form pools for group testing. The first strategy is random pooling; i.e., individuals are assigned to pools at random regardless of their covariate values in **x*** _{ij}*. The second strategy we consider is to form the pools homogeneously; i.e., individuals are assigned to pools based on their values of

We consider three types of model misspecification. For each of the simulation models we consider, we aim to emulate the low-prevalence settings where group testing is commonly used. To be specific, we choose parameter settings in each model so that the proportion of positive individuals is between 0.05 and 0.10. The first type of departure we consider is the omission of a quadratic term. We generate individual responses *Y _{ij}* from the model

$$\text{logit}({p}_{ij})={\beta}_{0}+{\beta}_{1}{x}_{ij}+{\beta}_{2}{x}_{ij}^{2},$$

(8)

where *x _{ij}* is a continuous covariate generated from a
(−3, 3) distribution. The parameters

The second type of departure we consider is the omission of an interaction term between a binary and a continuous covariate. The true model is

$$\text{logit}({p}_{ij})={\beta}_{0}+{\beta}_{1}{x}_{ij1}+{\beta}_{2}{x}_{ij2}+{\beta}_{3}{x}_{ij1}{x}_{ij2},$$

(9)

where *x _{ij}*

The third type of departure we consider is a misspecification of the link function. We specify models with different link functions using the modified Stukel’s generalized logistic model in (6). Individual samples are generated from

$$\text{logit}({p}_{ij})={h}_{{\alpha}_{0}}(\beta {x}_{ij}),$$

(10)

where *h _{α}*

For each model in Section 4.2, we first simulate *N* = 1000 individual observations (*Y _{ij}*,

When we form the pools for group testing, we use both random pooling (RP) and homogeneous pooling (HP), as described in Section 4.1. Random pools are created by assigning individual observations to pools in the order in which they were generated; homogeneous pools are created by sorting the covariates first and assigning individuals to pools based on this covariate ordering. HP is straightforward to implement when the simulation model has only one continuous covariate. For the model in (9) with dichotomous *x _{ij}*

For the HL test, we take *m* = 10 (i.e., the decile of risk statistic). For the IOS* _{A}* test, we calculate p-values using a parametric bootstrap with

Tables 1, ,2,2, and and3,3, contain estimated sizes and powers from our simulations under models (8), (9), and (10), respectively. First, note that all four GOF statistics *G*, *T _{S}*,

Among the four GOF statistics, each has excellent power to detect the omission of a quadratic term (Table 1) when HP is used. The same can be said for detecting a missing interaction term (Table 2), except for IOS* _{A}*. None of the four tests do particularly well at detecting a misspecified link function (Table 3). Our findings are not incongruous with those in [13, 22], where the performance of similar GOF tests is investigated for logistic regression with individual testing data. For the misspecified link function with

Verstraeten et al. [6] and Vansteelandt et al. [1] discuss a public-health study investigating HIV prevalence in rural Kenya. In this study, blood samples were collected from 740 pregnant women at four locations, and risk covariates were measured on each individual, including age, marital status, parity, and education level. Samples were first tested by an enzyme-linked immunosorbent assay (ELISA) test and positives were confirmed by a rapid assay test. Individual samples with a positive ELISA test result and a negative rapid assay test result were tested again with an ELISA to classify the sample as positive or negative. This sequential testing procedure was also implemented on pools of samples created at a central laboratory. Verstraeten et al. [6] report a significant cost reduction from using group testing while producing very similar prevalence estimates to those obtained from individual testing. To illustrate their regression methodology, Vansteelandt et al. [1] use the Kenya data and relate the individual prevalence to age through the logistic model

$$\text{logit}({p}_{ij})={\beta}_{0}+{\beta}_{1}{\text{Age}}_{ij}.$$

(11)

Our goal is to assess GOF for this model using pooled responses from the study.

After removing observations with missing values, as in [1], there are *N* = 706 individual observations available for analysis. We consider both “chronological” and “ordered” pool compositions. Chronological pools are constructed by the date of collection whereas ordered pools are constructed based on the values of the age covariate. As pointed out by the authors in [1], these compositions likely emulate random and homogeneous pooling, respectively. For both pooling strategies, we use *n* = 101 pools; 100 of size *c* = 7 and 1 of size *c* = 6, as in [1]. The sequential testing strategy used to elicit the pooled responses *Z _{i}* provides highly sensitive and specific assessments; we use

The state of Nebraska takes part in the nationwide Infertility Prevention Project (IPP) through its Sexually Transmitted Diseases and Infertility Control Program. At a number of clinic sites throughout the state, urine or swab specimens are collected on individuals and are then transported to the Nebraska Public Health Laboratory in Omaha for chlamydia and gonorrhea testing. Each year, about 30,000 tests are performed in the state; we use the individual testing data from the first quarter of 2006 for illustration purposes here. Specifically, the data set consists of chlamydia and gonorrhea infection statuses for 6,138 individual subjects as well as many potential risk covariates. Chen et al. [12] analyzes this same data set, treating the site effect as random. However, because we are interested in fixed effects models, we do not take this perspective and ignore the site effect. For each infection (chlamydia and gonorrhea, treated separately), we consider the fixed effects part of the model in [12] and relate the infection statuses *Y _{ij}* to four covariates via

$$\text{logit}({p}_{ij})={\beta}_{0}+{\beta}_{1}{\text{Age}}_{ij}+{\beta}_{2}{\text{Gender}}_{ij}+{\beta}_{3}{\text{Urethritis}}_{ij}+{\beta}_{4}{\text{Symptoms}}_{ij}.$$

(12)

Our goal is to assess GOF of this model for both infections using pooled responses.

For each infection, we construct artificial pooled testing responses *Z _{i}*, as in [12], by grouping individuals into pools of size

We have developed four global GOF tests for group testing regression models and have illustrated the usefulness of these tests using data from two infectious disease applications. Among the four tests, the Pearson *χ*^{2} and modified score tests have the highest power, and we recommend them for use as a global check for model adequacy. Of course, due to the information loss from pooling, these tests may not have sufficient power to detect certain types of misspecification. This is especially true when random pooling is used. The computation time involved for all tests is not overly prohibitive, possibly with the exception of the IOS* _{A}* test. Programs for data analysis are available by request.

Other GOF tests could be formulated for this problem; in fact, the four tests presented in this paper are only a subset of those we investigated. We initially considered formulating an unweighted sum of squares statistic, similar to the *Ŝ* statistic proposed in [13] for individual testing data. However, the statistic, when suitably normalized under group testing, did not provide a test with the correct size, and the test did not have higher power than the best tests herein. We also investigated a collection of bootstrapped statistics obtained from the predicted individual latent responses and the smoothed residual statistics discussed in [13, 15]. However, the bootstrap tests were not powerful, and there were technical difficulties in formulating the smoothed residuals tests with latent binary responses.

In this paper, we have taken a “global departure approach” to diagnose GOF with group testing models. It would be worthwhile to examine directed tests which target specific departures from A1–A3, although we leave this to future research. Future work could also include the generalization of our GOF tests to incorporate information from retesting subsets of positive pools or from other pooling algorithms [4]. Depending on the specific testing strategy used, some of the GOF tests discussed herein could be modified accordingly. However, the mathematical details could prove to be markedly more formidable. Detecting lack of fit in mixed effects group testing regression models [12] also remains as an open problem.

The authors would like to thank the Associate Editor and two anonymous referees for helpful comments on earlier versions of this article. We are grateful to Dr. Stijn Vansteelandt and his colleagues for providing us with the Kenya HIV data. We also thank Dr. Peter Iwen, Dr. Steven Hinrichs, and Philip Medina for their consultation on the Infertility Prevention Project. This research was supported by Grant R01 AI067373 from the National Institutes of Health.

Contract/grant sponsor: National Institutes of Health; contract/grant number: R01 AI067373

We provide additional details on the derivation of the first two approximate moments for *G* in Section 3.1. Suppose that the true value *β*_{0} is not on the boundary of the parameter space and denote the score function by *S*(** β**) =

$$\mathbf{0}=S(\widehat{\mathit{\beta}})\approx S({\mathit{\beta}}_{0})-\left\{{-\frac{\partial S(\mathit{\beta})}{\partial {\mathit{\beta}}^{\prime}}|}_{\mathit{\beta}={\mathit{\beta}}_{0}}\right\}(\widehat{\mathit{\beta}}-{\mathit{\beta}}_{0}),$$

(A.1)

where **0** is a (*p* + 1) × 1 vector of zeros. Under suitable regularity conditions and from (5), it follows from the Law of Large Numbers that

$${\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}|}_{\mathit{\beta}={\mathit{\beta}}_{0}}\approx E\left\{{\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}|}_{\mathit{\beta}={\mathit{\beta}}_{0}}\right\}=C({\mathit{\beta}}_{0}),$$

(A.2)

for large *n*. Approximating the observed information matrix −*S*(** β**)/

$$G(\widehat{\mathit{\beta}})\approx G({\mathit{\beta}}_{0})+C{({\mathit{\beta}}_{0})}^{\prime}{\mathcal{I}}^{-1}({\mathit{\beta}}_{0})S({\mathit{\beta}}_{0}).$$

(A.3)

Our approximation of the first two moments of *G* is based on (A.3). Note that in (A.3), the only random quantities are *G*(*β*_{0}) and *S*(*β*_{0}), and all other terms are constants which depend on the true parameter *β*_{0}. Let *f* (**z**; ** β**) denote the density function for

$$E\{G(\widehat{\mathit{\beta}})\}\approx E\{G({\mathit{\beta}}_{0})\}=\sum _{i=1}^{n}E\left[\frac{{\{{Z}_{i}-{p}_{i}({\mathit{\beta}}_{0})\}}^{2}}{{p}_{i}({\mathit{\beta}}_{0})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{i}({\mathit{\beta}}_{0})\}}\right]=n.$$

The fact that *E* {*G*(*β*_{0})} does not depend on *β*_{0} implies *E***_{β}** {

$$\frac{\partial}{\partial \mathit{\beta}}{E}_{\mathit{\beta}}\{G(\mathit{\beta})\}=\frac{\partial}{\partial \mathit{\beta}}{\int}_{{\mathcal{R}}^{n}}G(\mathit{\beta})f(\mathbf{z};\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}d\mathbf{z}=\mathbf{0}.$$

(A.4)

Interchanging integration and differentiation, the right-hand side of (A.4) can be written as

$$\begin{array}{l}\frac{\partial}{\partial \mathit{\beta}}{\int}_{{\mathcal{R}}^{n}}G(\mathit{\beta})f(\mathbf{z};\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}d\mathbf{z}={\int}_{{\mathcal{R}}^{n}}\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}f(\mathbf{z};\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}d\mathbf{z}+{\int}_{{\mathcal{R}}^{n}}G(\mathit{\beta})\frac{\partial f(\mathbf{z};\mathit{\beta})}{\partial \mathit{\beta}}d\mathbf{z}\\ ={E}_{\mathit{\beta}}\left\{\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}\right\}+{\int}_{{\mathcal{R}}^{n}}G(\mathit{\beta})\frac{\partial log\phantom{\rule{0.16667em}{0ex}}\{f(\mathbf{z};\mathit{\beta})\}}{\partial \mathit{\beta}}f(\mathbf{z};\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}d\mathbf{z}\\ ={E}_{\mathit{\beta}}\left\{\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}\right\}+{E}_{\mathit{\beta}}\{G(\mathit{\beta})S(\mathit{\beta})\},\end{array}$$

so that *E***_{β}** {

$$E\{G({\mathit{\beta}}_{0})S({\mathit{\beta}}_{0})\}=-E\left\{{\frac{\partial G(\mathit{\beta})}{\partial \mathit{\beta}}|}_{\mathit{\beta}={\mathit{\beta}}_{0}}\right\}=-C({\mathit{\beta}}_{0}).$$

Thus, it follows that

$$\begin{array}{l}cov\{G({\mathit{\beta}}_{0}),S({\mathit{\beta}}_{0})\}=E\{G({\mathit{\beta}}_{0})S{({\mathit{\beta}}_{0})}^{\prime}\}-E\{G({\mathit{\beta}}_{0})\}E{\{S({\mathit{\beta}}_{0})\}}^{\prime}\\ =E\{G({\mathit{\beta}}_{0})S{({\mathit{\beta}}_{0})}^{\prime}\}=-C{({\mathit{\beta}}_{0})}^{\prime},\end{array}$$

(A.5)

since *E* {*S*(*β*_{0})} = **0**. From (A.3), the approximate variance of *G* is given by

$$\begin{array}{l}var\{G(\widehat{\mathit{\beta}})\}\approx var\{G({\mathit{\beta}}_{0})\}+2cov\{G({\mathit{\beta}}_{0}),C{({\mathit{\beta}}_{0})}^{\prime}{\mathcal{I}}^{-1}({\mathit{\beta}}_{0})S({\mathit{\beta}}_{0})\}\\ +var\{C{({\mathit{\beta}}_{0})}^{\prime}{\mathcal{I}}^{-1}({\mathit{\beta}}_{0})S({\mathit{\beta}}_{0})\}\\ =var\{G({\mathit{\beta}}_{0})\}-2C{({\mathit{\beta}}_{0})}^{\prime}{\mathcal{I}}^{-1}({\mathit{\beta}}_{0})C({\mathit{\beta}}_{0})+C{({\mathit{\beta}}_{0})}^{\prime}{\mathcal{I}}^{-1}({\mathit{\beta}}_{0})C({\mathit{\beta}}_{0})\\ =var\{G({\mathit{\beta}}_{0})\}-C{({\mathit{\beta}}_{0})}^{\prime}{\mathcal{I}}^{-1}({\mathit{\beta}}_{0})C({\mathit{\beta}}_{0}),\end{array}$$

where the first equality follows from (A.5) and the fact that cov{*S*(*β*_{0})} = (*β*_{0}). To derive an explicit (approximate) expression for var{*G*(**)}, we need to calculate var{***G*(*β*_{0})}, *C*(*β*_{0}), and the information matrix (*β*_{0}). Define the *n* × 1 vector

$$\mathbf{a}(\mathit{\beta})={\left(\frac{1-2{p}_{1}(\mathit{\beta})}{{p}_{1}(\mathit{\beta})\{1-{p}_{1}(\mathit{\beta})\}},\frac{1-2{p}_{2}(\mathit{\beta})}{{p}_{2}(\mathit{\beta})\{1-{p}_{2}(\mathit{\beta})\}},\dots ,\frac{1-2{p}_{n}(\mathit{\beta})}{{p}_{n}(\mathit{\beta})\{1-{p}_{n}(\mathit{\beta})\}}\right)}^{\prime},$$

the (*p* + 1) × *n* matrix

$$\mathbf{Q}(\mathit{\beta})=\left(\frac{\partial {p}_{1}(\mathit{\beta})}{\partial \mathit{\beta}}\frac{\partial {p}_{2}(\mathit{\beta})}{\partial \mathit{\beta}}\cdots \frac{\partial {p}_{n}(\mathit{\beta})}{\partial \mathit{\beta}}\right),$$

and the *n* × *n* matrix

$$\mathbf{D}(\mathit{\beta})=\text{diag}\left[\frac{1}{{p}_{1}(\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{1}(\mathit{\beta})\}},\frac{1}{{p}_{2}(\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{2}(\mathit{\beta})\}},\dots ,\frac{1}{{p}_{n}(\mathit{\beta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{n}(\mathit{\beta})\}}\right].$$

Using this notation, it is straightforward to show that var{*G*(*β*_{0})} = **1**′**D**(*β*_{0})**1** − 4*n*, *C*(*β*_{0}) = −**Q**(*β*_{0})**a**(*β*_{0}), and (*β*_{0}) = **Q**(*β*_{0})**D**(*β*_{0})**Q**(*β*_{0})′, where **1** is an *n* × 1 vector of ones.

The Fisher information matrix in Section 3.2 is given by

$$\begin{array}{l}\mathcal{I}(\mathit{\theta})=E\{S(\mathit{\theta})S{(\mathit{\theta})}^{\prime}\}\\ =E\left(\left[\sum _{i=1}^{n}\frac{{Z}_{i}-{p}_{i}(\mathit{\theta})}{{p}_{i}(\mathit{\theta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{i}(\mathit{\theta})\}}\left\{\frac{\partial {p}_{i}(\mathit{\theta})}{\partial \mathit{\theta}}\right\}\right]\phantom{\rule{0.16667em}{0ex}}{\left[\sum _{i=1}^{n}\frac{{Z}_{i}-{p}_{i}(\mathit{\theta})}{{p}_{i}(\mathit{\theta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{i}(\mathit{\theta})\}}\left\{\frac{\partial {p}_{i}(\mathit{\theta})}{\partial \mathit{\theta}}\right\}\right]}^{\prime}\right)\\ =\sum _{i=1}^{n}E\left[\frac{{\{{Z}_{i}-{p}_{i}(\mathit{\theta})\}}^{2}}{{p}_{i}{(\mathit{\theta})}^{2}{\{1-{p}_{i}(\mathit{\theta})\}}^{2}}\left\{\frac{\partial {p}_{i}(\mathit{\theta})}{\partial \mathit{\theta}}\frac{\partial {p}_{i}(\mathit{\theta})}{\partial {\mathit{\theta}}^{\prime}}\right\}\right]\\ =\sum _{i=1}^{n}\left[\frac{1}{{p}_{i}(\mathit{\theta})\phantom{\rule{0.16667em}{0ex}}\{1-{p}_{i}(\mathit{\theta})\}}\left\{\frac{\partial {p}_{i}(\mathit{\theta})}{\partial \mathit{\theta}}\frac{\partial {p}_{i}(\mathit{\theta})}{\partial {\mathit{\theta}}^{\prime}}\right\}\right],\end{array}$$

where the penultimate equality holds because *Z _{i}* are independent and

$$\mathcal{I}(\widehat{\mathit{\theta}})={\gamma}_{12}^{2}\sum _{i=1}^{n}\left[{\left\{\prod _{j=1}^{{c}_{i}}(1-{\widehat{p}}_{ij})\right\}}^{2}\frac{{\mathbf{w}}_{i}{\mathbf{w}}_{i}^{\prime}}{{\widehat{p}}_{i}(1-{\widehat{p}}_{i})}\right],$$

where ${\mathbf{w}}_{i}={\sum}_{j=1}^{{c}_{i}}{\widehat{p}}_{ij}{({\mathbf{x}}_{ij}^{\prime},{\widehat{\eta}}_{ij}^{2}I({\widehat{\eta}}_{ij}<0)/2)}^{\prime}$.

1. Vansteelandt S, Goetghebeur E, Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics. 2000;56:1126–1133. doi: 10.1111/j.0006-341X.2000.01126.x. [PubMed] [Cross Ref]

2. Xie M. Regression analysis of group testing samples. Statistics in Medicine. 2001;20:1957–1969. doi: 10.1002/sim.817. [PubMed] [Cross Ref]

3. Gastwirth J, Johnson W. Screening with cost effective quality control: Potential applications to HIV and drug testing. Journal of the American Statistical Association. 1994;89:972–981.

4. Kim H, Hudgens M, Dreyfuss J, Westreich D, Pilcher C. Comparison of group testing algorithms for case identification in the presence of testing error. Biometrics. 2007;63:1152–1163. doi: 10.1111/j.1541-0420.2007.00817.x. [PubMed] [Cross Ref]

5. Hung M, Swallow W. Use of binomial group testing in tests of hypotheses for classification or quantitative covariables. Biometrics. 2000;56:204–212. doi: 10.1111/j.0006-341X.2000.00204.x. [PubMed] [Cross Ref]

6. Verstraeten T, Farah B, Duchateau L, Matu R. Pooling sera to reduce the cost of HIV surveillance: A feasibility study in a rural Kenyan district. Tropical Medicine and International Health. 1998;3:747–750. doi: 10.1046/j.1365-3156.1998.00293.x. [PubMed] [Cross Ref]

7. Pilcher C, Fiscus S, Nguyen T, Foust E, Wolf L, Williams D, Ashby R, O’Dowd J, McPherson J, Stalzer B, Hightow L, Miller W, Eron J, Cohen M, Leone P. Detection of acute infections during HIV testing in North Carolina. New England Journal of Medicine. 2005;352:1873–1883. [PubMed]

8. Kacena K, Quinn S, Hartman S, Quinn T, Gaydos C. Pooling of urine samples for screening for *Neisseria gonorrhoeae* by ligase chain reaction: Accuracy and application. Journal of Clinical Microbiology. 1998;36:3624–3628. [PMC free article] [PubMed]

9. Kacena K, Quinn S, Howell M, Madico G, Quinn T, Gaydos C. Pooling urine samples for ligase chain reaction screening for genital *Chlamydia trachomatis* infection in asymptomatic women. Journal of Clinical Microbiology. 1998;36:481–485. [PMC free article] [PubMed]

10. Remlinger K, Hughes-Oliver J, Young S, Lam R. Statistical design of pools using optimal coverage and minimal collision. Technometrics. 2006;48:133–143. doi: 10.1198/004017005000000481. [Cross Ref]

11. Hughes-Oliver J. Pooling experiments for blood screening and drug discovery. In: Dean Angela, Lewis Susan., editors. Screening: Methods for Experimentation in Industry, Drug Discovery, and Genetics. Springer; New York: 2006.

12. Chen P, Tebbs J, Bilder C. Group testing regression models with fixed and random effects. Biometrics. 2009 doi: 10.1111/j.1541-0420.2008.01183.x. in press. [PMC free article] [PubMed] [Cross Ref]

13. Hosmer D, Hosmer T, le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine. 1997;16:965–980. [PubMed]

14. Osius G, Rojek D. Normal goodness-of-fit tests for multinomial models with large degrees of freedom. Journal of the American Statistical Association. 1992;87:1145–1152.

15. le Cessie S, Van Houwelingen J. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics. 1991;47:1267–1282.

16. Cox R, Hinkley D. Theoretical Statistics. Chapman and Hall/CRC; Boca Raton: 1974.

17. Stukel T. Generalized logistic models. Journal of the American Statistical Association. 1988;83:426–431.

18. Hosmer D, Lemeshow S. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics: Theory and Methods. 1980;9:1043–1069.

19. Lemeshow S, Hosmer D. A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology. 1982;115:92–106. [PubMed]

20. Hosmer D, Lemeshow S, Klar J. Goodness-of-fit testing for the multiple logistic regression analysis when the estimated probabilities are small. Biometrical Journal. 1988;30:911–924.

21. Presnell B, Boos D. The IOS test for model misspecification. Journal of the American Statistical Association. 2004;99:216–227. doi: 10.1198/016214504000000214. [Cross Ref]

22. Capanu M, Presnell B. Misspecification tests for binomial and beta-binomial models. Statistics in Medicine. 2008;27:2536–2554. doi: 10.1002/sim.3049. [PubMed] [Cross Ref]

23. Bilder C, Tebbs J. Bias, efficiency, and agreement in group-testing regression models. Journal of Statistical Computation and Simulation. 2009;79:67–80. doi: 10.1080/00949650701608990. [PMC free article] [PubMed] [Cross Ref]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |