Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3926129

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Models of Item Response Data
- 3. GPCM from a Bayesian Perspective
- 3.1. Priors, hyperpriors, joint posterior density, and conditional posterior densities
- 4. Step-by-Step Guide in GPCM by R and WinBUGS
- 5. MCMC diagnostics
- 6. Basic item analysis and selection
- 7. Discussion
- References

Authors

Related links

Stat Med. Author manuscript; available in PMC 2014 February 17.

Published in final edited form as:

Stat Med. 2012 August 15; 31(18): 2010–2026.

Published online 2012 February 23. doi: 10.1002/sim.4475PMCID: PMC3926129

NIHMSID: NIHMS531669

See other articles in PMC that cite the published article.

The US Food and Drug Administration recently announced the final guidelines on the development and validation of Patient-Reported Outcomes (PROs) assessments in drug labeling and clinical trials. This guidance paper may boost the demand for new PRO survey questionnaires. Henceforth biostatisticians may encounter psychometric methods more frequently, particularly Item Response Theory (IRT) models to guide the shortening of a PRO assessment instrument. This article aims to provide an introduction on the theory and practical analytic skills in fitting a Generalized Partial Credit Model in IRT (GPCM). GPCM theory is explained first, with special attention to a clearer exposition of the formal mathematics than what is typically available in the psychometric literature. Then a worked example is presented, using self-reported responses taken from the International Personality Item Pool. The worked example contains step-by-step guides on using the statistical languages R and WinBUGS in fitting the GPCM. Finally, the Fisher information function of the GPCM model is derived and used to evaluate, as an illustrative example, the usefulness of assessment items by their information contents. This article aims to encourage biostatisticians to apply IRT models in the re-analysis of existing data and in future research.

Creating and validating a new PRO questionnaire typically requires a team of experts, among them behavioral scientists who analyze the questionnaire data. However, the analytic tasks may become the responsibility of the biostatistician if a behavioral scientist is not available, or if the behavioral scientist has limited training in the more sophisticated Item Response Theory (IRT) methods [1, 2, 3, 4, 5]. Recent publications on HRQOL assessments indicate the increasing use of IRT as the primary statistical method in developing and evaluating PRO instruments [6, 7, 8]. IRT models have also been used to address novel research questions in medicine and bioinformatics (e.g., [9] and [10]). Biostatisticians can easily reach new research territories by incorporating IRT to their data-analytic repertoire. This tutorial aims to provide a step-by-step guide in carrying out basic IRT analyses using freely available software programs R [11] and WinBUGS [12].

In December 2009, the US Food and Drug Administration (FDA) announced the final guidance on using PROs as part of clinical trials and for drug labeling [13], three years after the announcement of a draft guidance [14]. An European guidance document was published in 2005 by the European Medicines Agency (EMEA) in the evaluation of medical products in cancer by PROs [15]. The FDA guidelines are the regulatory agency's attempt to standardize the procedures in the creation, refinement, validation, and clinical use of psychometric instruments in clinical trials and drug labeling. These guidelines aim to provide practical advice to researchers, including how to define the PRO domain(s) to be measured (e.g., section III.C on conceptualizing the PRO constructs of interest), how to write survey items, how to decide what response format is most appropriate, and how to evaluate patient understanding. As part of “item analysis”, survey items may be deleted or modified in response to patient understanding and preliminary data analysis. This tutorial focuses primarily on item analysis from a Bayesian IRT perspective. Ultimately, each respondent's HRQOL is represented by one numeric value, derived from the respondent's responses to questionnaire items. Item analysis is typically iterative—several revisions may be required before the draft survey instrument is finally deemed valid (as per section III.E), reliable and responsive to changes in well-being, just to name a few of FDA's recommendations. However, the FDA guidelines offer no explicit recommendations on how to carry out these analyses, although the conceptual diagram hints at a Factor Analysis framework (Figure 4, [13]). Most of our biostatistician colleagues are familiar with Factor Analysis but not with IRT, which motivated our writing of this article to cover item analysis using IRT.

IRT modeling is also useful in its own right. For example, IRT has been applied in analyzing inter-rater agreement data in rating the severity of hip fractures [9], in microarray gene expression analysis to identify clusters of genes that are related to drug response in acute leukemias [10], and extensions of the classical Rasch model [16] has been applied in identifying clusters of students with discrete levels of latent academic achievements. More generally, the Rasch model is closely related to the conditional logit model [17] and the conditional logistic regression model for binary matched pairs [18, chapter 10]. Despite their versatility, IRT models have yet to gain wider use in biostatistics. This is in part because the command syntax of popular IRT software programs can be arcane for new users [2, for a list of packages]. The occasional user of IRT may be hesitant in investing the time and effort in learning it. We hope to facilitate the use of IRT models in this tutorial—a distillation of the cited sources into a practical guide using freely available statistical programming languages so that the readers can immediately apply these analytic skills in their own research.

Our primary goal is to guide the readers in applying their Bayesian analytic skills to a previously unfamiliar area of statistics. (thus, we provide details on how an IRT model is derived) We also hope that this article is equally useful to statisticians who are quite familiar with IRT and/or psychometrics but are new to a Bayesian analytic approach to IRT modeling. (thus, we provide details on Bayesian computation) The overall plan is to provide enough mathematics in both IRT model derivations and Bayesian computing so that they can be quickly deployed in practice. A worked example is provided, on the Generalized Partial Credit Model (GPCM) by Muraki [19]. The GPCM is among several commonly-used models in analyzing items with polytomous response categories [6]. This article is not about choosing one model out of alternative IRT models. Interested readers can find them elsewhere [20, 21]. The deviance information criterion (DIC) by Spiegelhalter, Best, Carlin, and van der Linde [22], which is calculated as part of default output from R2WinBUGS, is useful in model selection. What we lack in breadth we we hope to compensate for in depth.

This paper is organized as follows. Section 2 covers the theories behind widely-used IRT models, including the Rasch Model (RM) [16] for binary responses and the Partial Credit Model (PCM) [23] and the Generalized Partial Credit Model (GPCM) [19] for polytomous item responses. Section 3 develops the GPCM model from a Bayesian perspective. We do not go into the details on how to manually carry out Gibbs sampling in IRT, but provide a list of references for interested readers. Section 4 translates the GPCM model mathematics into WinBUGS syntax. We assume that the readers have gotten to the point of successfully installing R, the R2WinBUGS package in R, and WinBUGS on a computer platform of their choice. Section 5 illustrates how to diagnose the convergence of iterative sampling. Sections 3 – 5 are the main focus of this paper. They cover in detail how to fit the GPCM using R and WinBUGS. Section 6 focuses on the practical aspects of item analysis, on how to decide which questionnaire items should be modified or deleted. Our overall pedagogical plan is to provide enough mathematical rigor on IRT so that readers can acquire a working knowledge of IRT without the need to review the vast psychometric literature spanning several decades. Finally, we discuss in section 7 how these steps can be used to address the statistical considerations outlined in the FDA guidelines.

One of the simplest IRT models is the Rasch Model (RM) for dichotomized response data, developed by the Danish mathematician Georg Rasch [16]. The RM handles data coded as ‘correct’/‘incorrect’ or ‘yes’/‘no’ with a value of 1 coding a correct answer or a ‘yes’ response. The log odds of answering an item correctly is a function of two parameters:

$$\mathrm{ln}\left[\frac{\mathrm{Pr}({y}_{ij}=1\mid {\theta}_{i},{\beta}_{j})}{1-\mathrm{Pr}({y}_{ij}=1\mid {\theta}_{i},{\beta}_{j})}\right]={\theta}_{i}-{\beta}_{j},$$

(1)

where Pr(*y _{ij}* =
1|

$$\begin{array}{cc}\mathrm{Pr}({y}_{ij}=1\mid {\theta}_{i},{\beta}_{j})\hfill & \hfill ={\mathrm{logit}}^{-1}({\theta}_{i}-{\beta}_{j})\\ \hfill & \hfill =\frac{\mathrm{exp}({\theta}_{i}-{\beta}_{j})}{1+\mathrm{exp}({\theta}_{i}-{\beta}_{j})}\end{array}$$

(2)

$$\begin{array}{cc}\hfill & \hfill =\frac{1}{1+\mathrm{exp}(-({\theta}_{i}-{\beta}_{j}))}.\end{array}$$

(3)

Eq. (2) is more commonly used in the literature than the simpler Eq. (3, see [24]).

At approximately the same time in the US, Birnbaum [25] and Lord [3]
developed similar models independent of the RM (Ch 1, [2]). Their models contain a slope parameter
*α _{j}* addressing the discriminating
power of test items:

$$\mathrm{Pr}({y}_{ij}=1\mid {\theta}_{i},{\beta}_{j},{\alpha}_{j})=\frac{\mathrm{exp}\left[{\alpha}_{j}({\theta}_{i}-{\beta}_{j})\right]}{1+\mathrm{exp}\left[{\alpha}_{j}({\theta}_{i}-{\beta}_{j})\right]}.$$

(4)

The *α _{j}* parameter is interpreted as an
index of how strong of an indicator item

The RM and 2PL models are not designed to handle multiple response
categories or a mixture of yes-no and multiple responses. Masters [23] proposed a Partial Credit Model (PCM)
to handle polytomous items by extending the dichotomous RM to *K*
response categories. Master's PCM treats polytomous responses as ordered
performance levels, assuming that the probability of selecting the
*k*th category over the [*k* – 1]th
category is governed by the dichotomous RM. For example, *K* = 4
if the responses are “Strongly Disagree”,
“Disagree”, “Agree”, and “Strongly
Agree” and are scored as 0, 1, 2, and 3, respectively. A person who
chooses “Agree” is considered to have chosen
“Disagree” over “Strongly Disagree” and
“Agree” over “Disagree”, but to have failed to
choose “Strongly Agree” over “Agree”.

Below we follow Masters’ derivation of the PCM [23, p.158]. For each of the successive
response categories, the probability of endorsing response category
*k* over *k* – 1 for item
*j* follows a conditional probability governed by the RM:

$${\Phi}_{jk}=\frac{{\pi}_{jk}}{{\pi}_{j[k-1]}+{\pi}_{jk}}=\frac{\mathrm{exp}({\theta}_{i}-{\beta}_{jk})}{1+\mathrm{exp}({\theta}_{i}-{\beta}_{jk})}.$$

Thus, *π _{jk}* =
[(

$$\begin{array}{cc}\hfill \mathrm{Let}\phantom{\rule{thickmathspace}{0ex}}G=& \sum _{k=0}^{{m}_{j}}\mathrm{exp}\left[\sum _{h=0}^{k}({\theta}_{i}-{\beta}_{jh})\right],\phantom{\rule{thickmathspace}{0ex}}\text{we get},\phantom{\rule{thickmathspace}{0ex}}\text{in the}\phantom{\rule{thickmathspace}{0ex}}K=4\phantom{\rule{thickmathspace}{0ex}}\text{example}\hfill \\ \hfill {\pi}_{j0}=& \frac{1}{G},\hfill \\ \hfill {\pi}_{j1}=& \frac{\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})}{G},\hfill \\ \hfill {\pi}_{j2}=& \frac{\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j2})}{G},\phantom{\rule{thickmathspace}{0ex}}\text{and}\hfill \\ \hfill {\pi}_{j3}=& \frac{\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j2})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j3})}{G}.\hfill \end{array}$$

(5)

These expressions can be interpreted intuitively as though the
person “passes through” each of the preceding response categories
before finally stopping at a response [1,
p.165] that, presumably, most accurately reflects that person's standing on the
latent variable continuum. The adjacent *β _{jk}*
parameters represent the incremental item “difficulties” that the
person has to step through in order to reach the next response category. In
educational testing, the incremental item difficulties assign credits to
partially correct answers, and thus the name

The PCM can accommodate items with different response scales in one
HRQOL instrument, such as a combination of items assessed as
‘present/absent’, ‘persistent/intermittent/none’, or
on a 4-category rating scale as described above. To obtain a general notation,
we define that item *j* is scored *y* = 0, 1, 2, .
. . , *m _{j}* with

$$\mathrm{Pr}({y}_{ij}=y\mid {\theta}_{i},{\beta}_{jh})=\frac{\mathrm{exp}\sum _{h=0}^{k}({\theta}_{i}-{\beta}_{jh})}{\sum _{k=0}^{{m}_{j}}\mathrm{exp}\sum _{h=0}^{k}({\theta}_{i}-{\beta}_{jh})},$$

(6)

where the numerator is the individual response outcomes and the
denominator is the sum of all the possible outcomes; *i* = 1, 2,
. . . , *N* refers to individual respondents, *N*
refers to total number of respondents in the sample, *j* = 1, 2,
. . . , *J* refers to items, and *h* = 1, 2, . . .
, *k* refers to the number of response categories.

Muraki [19] further extended the
PCM into the Generalized PCM (GPCM) by introducing a discrimination parameter
*α _{j}* for each item.

$$\mathrm{Pr}({y}_{ij}=y\mid {\theta}_{i},{\alpha}_{j},{\beta}_{jh})=\frac{\mathrm{exp}\sum _{h=0}^{k}{\alpha}_{j}({\theta}_{i}-{\beta}_{jh})}{\sum _{k=0}^{{m}_{j}}\mathrm{exp}\sum _{h=0}^{k}{\alpha}_{j}({\theta}_{i}-{\beta}_{jh})}$$

(7)

The PCM is a special case of GPCM where all
*α _{j}* = 1. The parameters

To further reduce notation, we denote the observed item response vector as
** y**=
(

$$\ell (\mathit{y}\mid \theta ,\alpha ,\beta )=\prod _{i}\prod _{j}\prod _{h}\mathrm{Pr}({y}_{ij}\mid {\theta}_{i},{\alpha}_{j},{\beta}_{jh}),$$

(8)

where ** θ** is a random sample from

Bafumi et al. [30] proposed an
unconventional method to identify model parameters. Estimated parameters are
normalized as follows after the estimation is complete. This method has the
advantage that sample-specific distribution of
** θ** can be derived for future references
(e.g., difference between cohorts of respondents).

*θ*'s are normalized to have mean 0 and standard deviation 1. The normalized_{i}*θ*'s are named ${\theta}_{i}^{adj}$,_{i}*β*'s are normalized by the mean and standard deviation of_{jk}*θ*to retain a common scale for all parameters. The normalized_{i}*β*'s are named ${\beta}_{jk}^{adj}$, and- The multiplicative
*α*'s are multiplied by the standard deviation of_{j}*θ*to retain a common scale, The normalized_{i}*α*'s are named ${\alpha}_{j}^{adj}$,_{j}

hence retaining $\mathrm{Pr}(x\mid {\theta}^{adj},{\beta}^{adj})={\mathrm{log}}^{-1}\left[{\sigma}_{\theta}\phantom{\rule{thickmathspace}{0ex}}\alpha ({\scriptstyle \frac{\theta -{\mu}_{\theta}}{{\sigma}_{\theta}}}-{\scriptstyle \frac{\beta -{\mu}_{\theta}}{{\sigma}_{\theta}}})\right]={\mathrm{log}}^{-1}\left[\alpha (\theta -\beta )\right]=\mathrm{Pr}(x\mid \theta ,\beta )$. (subscripts omitted to simplify notations).

The prior distributions are:

$$\begin{array}{cc}\hfill \alpha \sim & logN({\mu}_{\alpha}=0.0,{\sigma}_{\alpha}^{2}=1.4),\hfill \\ \hfill \beta \sim & N({\mu}_{\beta}=0,{\sigma}_{\beta}^{2}=6.25),\hfill \\ \hfill \theta \sim & N(\mu ,{\sigma}^{2}),\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{with hyperpriors}\phantom{\rule{thickmathspace}{0ex}}\mu \sim N({\mu}_{0}=0.0,{\sigma}_{0}^{2}=100);\phantom{\rule{1em}{0ex}}{\sigma}^{2}\sim \text{Inv-Gamma(shape = 0.5, rate = 0.5 )},\hfill \end{array}$$

where *N* and *logN* represent the
normal and the log-normal distributions, respectively. Our prior knowledge about the
unknown ** θ** and

The joint posterior distribution is:

$$p(\theta ,\alpha ,\beta \mid \mathit{y})\propto p(\alpha ,\beta )p(\theta \mid \alpha ,\beta )p(\mathit{y}\mid \theta ,\alpha ,\beta )=p\left(\alpha \right)p\left(\beta \right)p(\theta \mid \alpha ,\beta )\ell (\mathit{y}\mid \theta .\alpha ,\beta ),$$

where ** α** and

$$p(\theta \mid \mathit{y},\alpha ,\beta )\propto p(\mathit{y}\mid \theta ,\alpha ,\beta )p\left(\theta \right)p\left(\alpha \right)p\left(\beta \right)\propto \ell (\mathit{y}\mid \theta ,\alpha ,\beta )p\left(\theta \right),$$

because the marginal
*p*(** α**) and

$$p(\theta \mid \mathit{y},\alpha ,\beta )\propto p\left(\theta \right)\ell (\mathit{y}\mid \theta ,\alpha ,\beta ),$$

(9)

$$p(\alpha \mid \mathit{y},\theta ,\beta )\propto p\left(\alpha \right)\ell (\mathit{y}\mid \theta ,\alpha ,\beta ),$$

(10)

$$p(\beta \mid \mathit{y},\theta ,\alpha )\propto p\left(\beta \right)\ell (\mathit{y}\mid \theta ,\alpha ,\beta ).$$

(11)

Sampling from the conditional distributions in Eqs. (9) – (11) may be carried out using a rejection sampling with Gibbs approach, which is the method implemented in the WinBUGS software program [12]; or a Metropolis-Hastings within Gibbs algorithm (e.g., [32, 33]). Readers interested in further technical details may follow-up with other published work (e.g., [32, 29, 33, 34, 20] and [35, p.444]). Next section provides a step-by-step guide for using R and WinBUGS to estimate the GPCM parameters.

The dataset is the bfi dataset in the R package psych [36], which contains the responses of 2,800 subjects to 25 personality self-report items. The items map onto five putative psychological dimensions, the “Big-Five” personality traits [37]: Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness. This section focuses on the 5 items assessing Neuroticism, a tendency to easily experience anger (‘1. Get angry easily’), unpleasant affect/emotion (‘2. Get irritated easily’ and ‘3. Have frequent mood swings’), depression (‘4. Often feel blue’), and anxiety (‘5. Panic easily’).

The response categories were “1: Very Inaccurate”,
“2: Moderately Inaccurate”, “3: Slightly
Inaccurate”, “4: Slightly Accurate”, “5: Moderately
Accurate”, and “6: Very Accurate”. The bfi dataset was
chosen because it is suitable to examine key aspects of IRT modeling. IRT makes
inferences on an unobservable, underlying psychological construct, in this case
a Neuroticism personality trait, based on each person's responses to a few
straightforward items. The response categories describe what the person is like
generally and do not depend on environmental, social and interpersonal contexts.
The observed responses may be viewed as a self-reported symptom
severity—a higher summation score representing more severe Neuroticism.
IRT goes beyond simple summary scores and takes into consideration, for each
item, its symptom severity threshold (*β _{jh}*)
and sensitivity to changes in Neuroticism
(

Two syntax files are needed, one for R and one for WinBUGS. The R syntax file prepares the data for fitting the model in the WinBUGS syntax file. The R syntax is in Listing 1. Lines 1 – 3 load the required packages into R. Additional required packages are loaded automatically (e.g., packages coda is required by R2WinBUGS). Lines 6 – 7 extract only the neuroticism items. Line 9 shows that we use data from the first 500 subjects and convert them into a matrix. Lines 10 – 12 specify the number of subjects, the total number of items, and the response category for each item. Lines 13 – 18 set up the data and names of parameters to be fitted by WinBUGS. Then bugs() is called to pass the data and the Gibbs sampler parameters to WinBUGS for analysis. Lines 24 and 31 calculate the elapsed time. The bugs() function needs to know the name of the WinBUGS syntax file and settings for the iterative chains. Random initial values are generated by WinBUGS if the init option is set to NULL. Here we set codaPkg = TRUE to save the iterative chains for further analyses by the coda package. Line 28 sets bugs.seed = 7 for reproducibility. The results are returned back to R as neuro.bugs.

A total of 152,000 iterative simulations were done per chain (default 3 chains), the first 2,000 iterations discarded, and 10,000 iterations saved after thinning. This long chain was guided by convergence diagnostics (described below in section 5). One advantage of using the R2WinBUGS package is that it automatically prints out two useful statistics for each parameter estimate: 1) the effective sample size (sample size adjusted for autocorrelation across simulations); and 2) the Gelman and Rubin convergence diagnostic [38], shown as Rhat. Values of Rhat close to 1 indicate convergence [39, 40, 41].

Listing 2 shows the WinBUGS
syntax. Lines 1 – 15 are adapted from Curtis [21]. Line 4 shows that the item responses can be one of
K[j] possible values, with the probability of each response separately
specified. Line 6 samples each person's latent characteristic from
*N*(*μ, σ*^{2}) with
hyperpriors $\mu \sim N({\mu}_{0}=0.0,{\sigma}_{0}^{2}=100)$ and *σ*^{2} ~
Inv-Gamma(shape = 0.5, rate = 0.5), specified by lines 25 and 26,
respectively.

Lines 11 – 13 calculate the numerators in the GPCM model in Eq. (7). The denominator is harder
to track. Recall that in Eq. (5),
the denominator *G* = 1 + *δ*_{1} +
*δ*_{2}*δ*_{1} +
*δ*_{3}*δ*_{2}*δ*_{1}
+ . . . +
*δ _{mj}*

Table 1 shows that the mean parameter estimates by WinBUGS agree well with those calculated by maximum likelihood estimation.

Figure 1 is an empirical
quantile-quantile plot of the *β _{jh}* estimates
derived from the maximum likelihood method in gpcm() against the Bayesian
estimates. The two distributions agree well, except that, for values outside
[–1.5, +1.5], the Bayesian estimates are distributed slightly closer to
the mean than the maximum likelihood estimates. The small differences between
the two sets of estimates disappear if we set

Several diagnostics for the MCMC chains can be calculated by the coda package (or boa). The analyst is advised to consult several diagnostics because each has strengths and weaknesses, as discussed in review papers by Brooks and Robers [42] and Cowles and Carlin [43].

Listing 3 shows how to use the menu-driven tool codamenu() to calculate some of them, including the diagnostics by Gelman-Rubin [38], Geweke [44], Heidelberger-Welch [45], and the Raftery-Lewis [46] run-length estimate.

Example steps in the use of the interactive codamenu () function to calculate the
convergence diagnostics. User's inputs are bolded.

The Gelman-Rubin diagnostic requires parallel chains from dispersed initial
values. The idea is to compare the between-chain and within-chain variabilities. If
all chains converge to a similar posterior distribution, then the between-chain
variability would be small relative to the within-chain variability. A ratio is
calculated from a combined variability estimate (a weighted average of between-chain
and within-chain variabilities) and the within variability alone. The square root of
this ratio is the *potential scale reduction factor*; a longer
simulation is indicated with a scale reduction value greater than 1.0. The Geweke
diagnostic tests whether the early and latter iterative sequences (defaults are
first 10% and the latter 50%) are comparable in their averages. A
*z*-statistic is calculated, a large value (e.g., out of the
±2 bounds) indicate the need for a longer chain.

The Heidelberger-Welch diagnostics include a *stationary* and
a *half-width* test. The stationary test uses the
Cramér-von-Mises statistic to successively test the null hypothesis that the
sampled values come from a covariance stationary process. The whole chain is first
used to calculate The Cramér-von-Mises statistic. If it passes the test (null
hypothesis not rejected), then the whole chain is considered stationary. If it fails
the test, then the initial 10% of the chain is dropped and tested again. Then 20% is
dropped, and so on until either the chain passes the test (e.g., at 30% reduction)
or the remaining data are no longer sufficient (default to 50%).

The half-width test is then applied to the part of the chain that is deemed stationary. It tests whether or not the chain provides enough data to determine the confidence interval of the mean estimate to within a specific accuracy. The accuracy measure is the ratio of half of the width of the 95% confidence interval to the mean estimate. The default accuracy is 0.1 of the accuracy of the 95% confidence interval.

The Raftery-Lewis estimate focuses on achieving a prespecified precision of
specific quantiles of a chain. The default is that the 2.5th percentile of a
parameter estimate must be within a precision of 0.005 quantile units with 0.95
probability. The output reports the minimum length of the burn-in period
(*M*), the estimated number of post burn-in iterations required
to meet the criteria (*N*), the minimum number of post burn-in
iterations (*N _{min}*, assuming zero autocorrelation), and
the

Our WinBUGS chains have generally exceeded the Raftery-Lewis burn-in length and post burn-in iterations, with the obvious exceptions of mu0 and var0. The estimated total run lengths are markedly reduced to 10,350 and 12,714, respectively, by lowering the default 0.025 quantile to 0.02 (not shown). Moreover, mu0 and var0 pass the Gelman-Rubin, the Geweke, and the Heidelberger-Welch diagnostics. We suspect that the present run length is satisfactory after all, given Jackman's observation [40, sec 6.2] on the problems with the Raftery-Lewis, the passing of the three other diagnostics, and the markedly reduced run length estimates when the default quantile is slightly lowered.

An item analysis helps determine the final length and response format of a
new PRO instrument. The process can be rather detailed [47]. The FDA guidance provides no specific recommendations on
how to carry it out. An important goal, however, is to reduce the draft
questionnaire to a shorter version with carefully crafted items that are selectively
sensitive to a wide range of latent *θ*. This goal can often
be adequately addressed by visual inspections of IRT model properties. Two such
properties are covered in this section: 1) fitted response probability curves, and
2) Fisher information curves with respect to the latent
** θ**. These basic visual explanations are
often the first steps after estimating the IRT parameters, and are part of popular
introductory texts [2, 1, 48]. More rigorous
methods are usually found in journal articles [49, 50].

Figure 2 shows an example of
marginal posterior density curves for all items, across a range of
** θ** values. The R commands used to
plot the curves are shown in Listing
4.

These “category response curves” [2] are often used to evaluate the need for item and scale reduction. For example, there is a visible overlap between response categories 2, 3, and 4 for all items (“2: Moderately Inaccurate”, “3: Slightly Inaccurate”, and “4: Slightly Accurate”, respectively). The parameter estimates are also very similar in value (see Table 1). These response categories may be merged into one in the next version of the instrument. Another noteworthy pattern is that the probability of endorsing item 4 ‘feeling blue’, an indication of depressive symptoms, is lower than in other items. Low endorsement probability is not necessarily a problem, because in this case we know that depression symptoms are not prevalent.

Generally, the category response curves would preferably cover a range
of values on the latent characteristics. Low peaks indicate low probability of
endorsement and relatively poorer *α _{j}*
discrimination parameters. They need to be further explored. For example, there
is an approximately 1 in 5 probability of responding a 3 ‘Slightly
Inaccurate’ in all items for persons at average levels of Neuroticism
(i.e., at

Another frequently asked question concerns how certain we are about a
person's estimated level on the latent *θ* continuum. This
question can be addressed by visually inspecting the Fisher information of an
item over a range of *θ* values. A higher Fisher
information means lower uncertainty for the *θ* estimate
[1, 3] and vice versa.

Figure 3 plots the item information
curves across a range of *θ* values for all items. Item 2
(‘irritated’) is visibly the most informative item. The next most
informative items capture ‘anger’ and ‘mood swings’;
and ‘anger’ tends to provide the most information at very intense
levels of neuroticism (i.e., *θ* > 1.5). These
three items do not appear to provide much information for
*θ* values outside the [–1.5, 2.5] range. New
items may be needed if measuring very low levels of neuroticism is of interest.
Items 4 and 5, on the other hand, share a similar low information profile.
Scientifically, these items may need to be kept. From the perspective of item
information, they do not seem to contribute much to the information about
*θ*. Neither item wins over the other in pinpointing a
person's underlying neuroticism. They may be revised or replaced if alternate
items are available.

This 5-item example shows that redundant items can be useful in a draft survey. Some seemingly redundant items may provide information over a range of the latent characteristic not covered by other theoretically more promising items. This example lacks items that are sensitive to latent characteristics below 1.5 standard deviations of the norm. Additional items would be necessary if detecting these extreme latent values were of interest. (It is hard to conceive of a need to diagnose extremely low neuroticism levels)

Proprietary software programs generate the Fisher information curves. However, except for the simplest cases [5, pp. 70 – 71], the computation formulae are not easily accessible to non-experts. This can be seen in the somewhat technical re-parameterization in Muraki's [51] derivations for the GPCM. These complications may be attributed in part to the long time since Birnbaum's [25, Chapters 17 & 20] and Samejima's [52] pioneering works more than 40 years ago. There ought to be an accessible approach to the information function, one which is based on Fisher's original definition [53]. Anyone who is comfortable with Fisher's original definition should be able to follow the computation easily. That is exactly what we have attempted to do in Appendix B. We aim to provide enough details to save the readers from ploughing through other sources and yet not necessarily finding the needed answers [3, 5, 52, 54, 55, 1, 2, 25].

An illustrative example is provided in this article to show how to fit a GPCM model using R and WinBUGS. The primary pedagogical goal is to include sufficient practical analytic techniques in one guide so that readers unfamiliar with IRT can immediately apply these skills in their own work. Also covered are a few basics in Gibbs sampler and in diagnosing the convergence of iterative simulation by Markov Chain Monte Carlo methods. The parameter estimates agree well with those calculated by the maximum likelihood method implemented in the ltm package. This leads us to encourage the further use of R and WinBUGS to reduce the reliance on proprietary computer software programs in IRT analysis.

We focus primarily on item analysis, which is among the most technically challenging steps towards claiming a PRO instrument reliable, valid, and responsive to changes in well-being (as per section III.E of the FDA guidance). This tutorial does not cover practical aspects of carrying out analyses of reliability, validity, and responsiveness. However, these key concepts involve statistics that are comparatively more straightforward. For example, the concept of responsiveness can be tested by comparing the PRO scores of individuals whose symptoms have changed substantially [56]. Reliability can be established by Cronbach's alpha statistic [47], and by correlating scores between repeated assessments. Validity can be established via “concurrent validity”, by correlating the scores of the target instrument and scores derived from other instruments of the same or similar construct that have previously been validated. Another important concept is the “construct validity”, which is commonly established by a high degree of correlation between the target assessment and other assessments of similar construct (“convergent validity”) and the low correlation between the target assessment with other instruments of dissimilar construct (“discriminant validity”) [57].

Readers interested in other IRT models can find them in Curtis [21], who recently published a collection of WinBUGS code for many more IRT models, including the GPCM. However, Curtis did not provide a step-by-step link, as we have, between the mathematics and the WinBUGS code. The learner is encouraged to tackle the mathematics before plugging in the WinBUGS code—in our opinion the recommended learning approach for biostatisticians who are already familiar with Bayesian statistics. A Bayesian approach to IRT has its own advantages. Baldwin, Bernstein, and Wainer [9] recently reported in this journal that a Bayesian approach to IRT requires much smaller sample sizes than is required by conventional likelihood methods. The GPCM model in Eq. (7) may appear complex, but a clear understanding of the mathematics reduces the model to three nested loops in the WinBUGS code. We believe that this resemblance between mathematics and computer syntax is an important advantage in preferring WinBUGS over special-purpose computer packages. It facilitates a new learner to develop a deeper understanding of the IRT theory as well as the statistical computation. WinBUGS forces the learner to acquire a clearer understanding of the statistical model, which hopefully discourages indiscriminately applying existing data analysis “recipes”. Similarly, experienced biostatisticians who are already familiar with the R language no longer need to learn the sometimes eccentric computer programming languages of proprietary IRT software. It opens up new possibilities as well, as seen in analyses on gene expression data [10], explanatory IRT models [58], and recent psychometric work in cognitive models of IRT [59, 60].

Iterative simulation is time-consuming. It is the main disadvantage against the Bayesian IRT approach. The time and effort diminish the enthusiasm for WinBUGS and its open-source version OpenBUGS in IRT analysis (also JAGS [61]). Analysts save time by using existing macros in their preferred statistical computer package, for example, the macros in SAS [62, 63, 64] and Stata [65, 66, 67], and solutions based on SAS NLMIXED procedure [68], if all they want are the parameter estimates and other statistics supported by these macros. Another limitation is the somewhat steep learning curve for beginners. It is compounded by practical complications. For example, problems in model convergence are hard to diagnose and rectify. Mistakes in WinBUGS syntax are hard to debug because error messages are often non-specific. These are some of the reasons why beginners are intimated by WinBUGS. These limitations notwithstanding, we believe that circumstances will improve over time, with the publication of WinBUGS code collections and tutorials similar to this one. Psychometric models and methods in R are rapidly emerging. In 2007, Journal of Statistical Software dedicated volume 20 to psychometric methods in R. Another collaborative effort is found in the online “Task View” maintained by Mair and Hatzinger (http://cran.r-project.org/web/views/Psychometrics.html, last accessed July, 2011). New development efforts in the OpenBUGS program include improvements in the documentations as well as cross-platform compatibility. We are optimistic that, in time, as new and accessible knowledge bases accrue, R and WinBUGS will become the preferred tools for fitting IRT models.

Contract/grant sponsor: Grant support: 2007 Prevention Control and Population Research Program (PCPR) Goldstein Award (Li).

NIH Training grant T32CA009461

NIH CTSC Award to Weill Cornell Medical College, NIH UL1-RR024996

The *G* term in Eq. (5) is crucial in understanding how the PCM
equation is derived. However, a few algebraic steps were omitted in
Masters’ original derivations [23, p. 158] and Muraki's [19] subsequent definition of the *G* term. They
are restored here. These details are also helpful in tracking the model
syntax in WinBUGS. We begin by restating the conditional probability of
endorsing each successive response category *k* over
*k* – 1 for item *j*:

$${\Phi}_{jk}=\frac{{\pi}_{jk}}{{\pi}_{j[k-1]}+{\pi}_{jk}}=\frac{\mathrm{exp}({\theta}_{i}-{\beta}_{jk})}{1+\mathrm{exp}({\theta}_{i}-{\beta}_{jk})}.$$

In the *K* = 4 example above,

$$\begin{array}{cc}\hfill {\Phi}_{j1}=& \frac{{\pi}_{j1}}{{\pi}_{j0}+{\pi}_{j1}}=\frac{\mathrm{exp}({\theta}_{i}-{\beta}_{j1})}{1+\mathrm{exp}({\theta}_{i}-{\beta}_{j1})},\phantom{\rule{1em}{0ex}}\left[\text{Disagree over Strongly Disagree}\right]\hfill \\ \hfill {\Phi}_{j2}=& \frac{{\pi}_{j2}}{{\pi}_{j1}+{\pi}_{j2}}=\frac{\mathrm{exp}({\theta}_{i}-{\beta}_{j2})}{1+\mathrm{exp}({\theta}_{i}-{\beta}_{j2})},\phantom{\rule{1em}{0ex}}\left[\text{Agree over Disagree}\right]\hfill \\ \hfill {\Phi}_{j3}=& \frac{{\pi}_{j3}}{{\pi}_{j2}+{\pi}_{j3}}=\frac{\mathrm{exp}({\theta}_{i}-{\beta}_{j3})}{1+\mathrm{exp}({\theta}_{i}-{\beta}_{j3})}.\phantom{\rule{1em}{0ex}}\left[\text{Strongly Agree over Agree}\right]\hfill \end{array}$$

Thus,
*π*_{j1} =
[(*π*_{j0} +
*π*_{j1})
exp(*θ*_{i} –
*β*_{j1})]/[1 +
exp(*θ _{i}* –

$$\begin{array}{cc}\hfill {\pi}_{j1}[1+\mathrm{exp}({\theta}_{i}-{\beta}_{j1})]=& ({\pi}_{j0}+{\pi}_{j1})\mathrm{exp}({\theta}_{i}-{\beta}_{j1})\hfill \\ \hfill {\pi}_{j1}+\mathrm{exp}({\theta}_{i}-{\beta}_{j1})\cdot {\pi}_{j1}=& \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\cdot {\pi}_{j0}+\mathrm{exp}({\theta}_{i}-{\beta}_{j1})\cdot {\pi}_{j1},\hfill \end{array}$$

so *π*_{j1} =
exp(*θ*_{i} –
*β*_{j1}) ·
*π*_{j0} because the
exp(*θ _{i}* –

$$\begin{array}{cc}\hfill {\pi}_{j0}=& \frac{1}{1+{\delta}_{1}+{\delta}_{2}{\delta}_{1}+{\delta}_{3}{\delta}_{2}{\delta}_{1}},\phantom{\rule{1em}{0ex}}\text{and because}\phantom{\rule{thickmathspace}{0ex}}\mathrm{exp}\left(0\right)=1,\hfill \\ \hfill {\delta}_{1}=& 1\times {\delta}_{1}=\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1}),\hfill \\ \hfill {\delta}_{2}{\delta}_{1}=& \mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j2}),\hfill \\ \hfill {\delta}_{3}{\delta}_{2}{\delta}_{1}=& \mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j2})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j3}).\hfill \end{array}$$

Let *G* = 1 +
*δ*_{1} +
*δ*_{2}*δ*_{1}
+
*δ*_{3}*δ*_{2}*δ*_{1},
we have

$$\begin{array}{cc}\hfill {\pi}_{j0}=& \frac{1}{G},\hfill \\ \hfill {\pi}_{j1}=& \frac{1\times {\delta}_{1}}{G}=\frac{\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})}{G},\hfill \\ \hfill {\pi}_{j2}=& \frac{1\times {\delta}_{2}{\delta}_{1}}{G}=\frac{\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j2})}{G},\hfill \\ \hfill {\pi}_{j3}=& \frac{1\times {\delta}_{3}{\delta}_{2}{\delta}_{1}}{G}=\frac{\mathrm{exp}\left(0\right)\times \mathrm{exp}({\theta}_{i}-{\beta}_{j1})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j2})\times \mathrm{exp}({\theta}_{i}-{\beta}_{j3})}{G}.\hfill \end{array}$$

To obtain a general notation, we define that item *j*
is scored *x* = 0, 1, 2, . . . ,
*m _{j}* with

$$G=\sum _{k=0}^{{m}_{j}}\mathrm{exp}\left[\sum _{h=0}^{k}({\theta}_{i}-{\beta}_{jh})\right],$$

(12)

where ${\sum}_{h=1}^{0}({\theta}_{i}-{\beta}_{jh})\equiv 0$ for convenience. Finally, a general model expression for
*K _{j}* =

This appendix covers the derivations of the item information function which assesses the amount of information that can be expected from an item. The derivations are pedagogically useful, especially for Biostatisticians who are family with Fisher's original definition [69] but are not familiar with psychometrics. Samejima [52, Chapter 6] was first to introduce Fisher's information into IRT in her 1969 monograph, according to Baker & Kim [5, p.222]. The detailed derivations in Samejima's work seem to have disappeared from the literature since then (e.g., [5, p.222]; [70, p.374]; [1, p.208]), including a 1994 paper by Samejima [54, p.308]. Here we summarize our independent derivations starting from Fisher's definition.

Based on Fisher's original definition, the information function for
the latent characteristic *θ* given data
*y* is:

$$I(\theta \mid y)=\int {\left[\frac{\partial}{\partial \theta}\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}\mathrm{Pr}(y\mid \theta )\right]}^{2}\mathrm{Pr}(y\mid \theta )dy,$$

(13)

where for the GPCM the log likelihood of the probability
density function is defined in Eqs.
(7) – (8).
Therer is no need to include ** α** and

$$I(\theta \mid y)=-\int \left[\frac{{\partial}^{2}}{\partial {\theta}^{2}}\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}\mathrm{Pr}(y\mid \theta )\right]\mathrm{Pr}(y\mid \theta )dy,$$

(14)

and thus the following definition given in texts on Bayesian statististics such as Lee [31, p.82], Berger [71, section 3.3.3], and Gelman et al. [72, section 2.9]:

$$I(\theta \mid y)=-\mathsf{E}[{\partial}^{2}\left(\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}\mathrm{Pr}(y\mid \theta )\right)\u2215\partial {\theta}^{2}],$$

(15)

the expectation being taken over all possible values of
*y* for fixed item parameters
** α** and

Muraki's [19] derivations for
the item information function begin with Eq. (14). The integration is replaced with summation
across *m _{j}* discrete response categories for item

$${I}_{j}\left(\theta y\right)=\sum _{k=0}^{{m}_{j}}{p}_{jk}\left[-\frac{{\partial}^{2}}{\partial {\theta}^{2}}\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}{p}_{jk}\right],$$

(16)

where the information of all possible response categories are weighted by their corresponding response probabilities. Up to this point the definition of item information should be familiar to biostatisticians. However, an alternate but unfamiliar definition is used in the psychometric literature: (e.g., [54, equation (2)]):

$${I}_{j}(\theta \mid y)=\sum _{k=0}^{{m}_{j}}\frac{{\left[{\scriptstyle \frac{\partial}{\partial \theta}}{p}_{jk}\right]}^{2}}{{p}_{jk}}.$$

(17)

Lee [31, pp. 82 – 83] provides a general proof that Eqs. (16) and (17) are indeed equivalent. However, the seemingly different equestions may be a source of confusion and frustration in learning Fisher information in IRT. The intermediate derivations from Eqs. (16) to (17) are not found in recently published texts (e.g., [5, p.222]; [70, p.374]; [1, p.208]; [54, p.308]). The missing derivations are summarized below.

In Eq. (16), we need
to find the second derivative of log *p* with respect to
*θ*. We take the first derivative and obtain

$$\frac{\partial}{\partial \theta}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}p=\frac{1}{p}\cdot \frac{\partial}{\partial \theta}p$$

by the General Power Rule in calculus. We then take the negative of the second derivative, $-{\scriptstyle \frac{\partial}{\partial \theta}}[{\scriptstyle \frac{1}{p}}\cdot {\scriptstyle \frac{\partial}{\partial \theta}}p]$ and follow the Product Rule to get:

$$-\frac{\partial}{\partial \theta}\left[\frac{1}{p}\cdot \frac{\partial}{\partial \theta}p\right]=-\left[\frac{1}{p}\cdot \frac{\partial}{\partial \theta}\left[\frac{\partial}{\partial \theta}p\right]+\frac{\partial}{\partial \theta}p\cdot \frac{\partial}{\partial \theta}\frac{1}{p}\right]=-\left[\frac{1}{p}\cdot \frac{{\partial}^{2}}{\partial \theta}p+\frac{\partial}{\partial \theta}p\cdot \frac{\partial}{\partial \theta}{p}^{-1}\right]=-\left[\frac{1}{p}\cdot \frac{{\partial}^{2}}{\partial \theta}p+\frac{\partial}{\partial \theta}p\cdot \left[-1\cdot {p}^{-2}\cdot \frac{\partial}{\partial \theta}p\right]\right]=-\left[\frac{1}{p}\cdot \frac{{\partial}^{2}}{\partial {\theta}^{2}}p-{\left[\frac{\frac{\partial}{\partial \theta}p}{p}\right]}^{2}\right]=\left[{\left[\frac{\frac{\partial}{\partial \theta}p}{p}\right]}^{2}-\frac{1}{p}\cdot \frac{{\partial}^{2}}{\partial \theta}p\right].$$

By plugging in the previous result back into Eq. (16), we get

$$\begin{array}{cc}\hfill {I}_{j}\left(\theta \right)=& \sum _{k=0}^{{m}_{j}}{p}_{jk}\left[{\left[\frac{\frac{\partial}{\partial \theta}{p}_{jk}}{{p}_{jk}}\right]}^{2}-\frac{\frac{{\partial}^{2}}{\partial {\theta}^{2}}{p}_{jk}}{{p}_{jk}}\right]\hfill \\ \hfill =& \sum _{k=0}^{{m}_{j}}\frac{{\left[\frac{\partial}{\partial \theta}{p}_{jk}\right]}^{2}}{{p}_{jk}}-\sum _{k=0}^{{m}_{j}}\frac{{\partial}^{2}}{\partial {\theta}^{2}}{p}_{jk}.\hfill \end{array}$$

(18)

The second term of Eq.
(18), ${\sum}_{k=0}^{{m}_{j}}{\scriptstyle \frac{{\partial}^{2}}{\partial {\theta}^{2}}}{p}_{jk}$, must sums to zero. It is easier to see why this must be
the case using the 4-category example above. When *θ*
is fixed, *p*_{0} + *p*_{1} +
*p*_{2} + *p*_{3} = 1
because each person must choose one of the *m _{j}*
possible responses for item

$${I}_{j}\left(\theta \right)=\sum _{k=0}^{{m}_{j}}\frac{{\left[\frac{\partial}{\partial \theta}{p}_{jk}\right]}^{2}}{{p}_{jk}}.$$

1. de Ayala RJ. The Theory and Practice of Item Response Theory. The Guilford Press; New York, NY: 2009.

2. Embretson SE, Reise SP. Item response theory for psychologists. LEA; Mahwah, NJ: 2000.

3. Lord FM. Application of item response theory to practical testing
problems. Erlbaum; Hillsdale, NJ: 1980.

4. van der Linden W, Hambleton RK. Handbook of modern item response theory. Springer-Verlag; New York: 1997.

5. Baker FB, Kim SH. Marcel Dekker, Inc.; New York: 2004. Item Response Theory: Parameter Estimation
Techniques.

6. Bjorner JB, Chang CH, Thissen D, Reeve BB. Developing tailored instruments: item banking and computerized
adaptive assessment. Qual Life Res. 2007;16(Suppl 1):95–108. [PubMed]

7. Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, Thissen D, Revicki DA, Weiss DJ, Hambleton RK, et al. Psychometric evaluation and calibration of health-related quality
of life item banks: plans for the patient-reported outcomes measurement
information system (promis). Med Care. 2007;45:S22–31. [PubMed]

8. Cella D, Riley W, Stone A, Rothrock N, Reeve B, Yount S, Amtmann D, Bode R, Buysse D, Choi S, et al. The patient-reported outcomes measurement information system
(promis ) developed and tested its first wave of adult self-reported health
outcome item banks: 2005-2008. J Clin Epidemiol. 2010;63(11):1179–94. [PMC free article] [PubMed]

9. Baldwin P, Bernstein J, Wainer H. Hip psychometrics. Statistics in Medicine. 2009;28:2277–2292. [PubMed]

10. Li H, Hong F. Cluster-rasch models for microarray gene expression
data. Genome Biol. 2001;2(8) RESEARCH0031. [PMC free article] [PubMed]

11. R Development Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2010. URL http://www.R-project.org/, ISBN 3-900051-07-0.

12. Lunn DJ, Thomas A, Best N, Spiegelhalter D. Winbugs – a bayesian modelling framework: concepts,
structure, and extensibility. Statistics and Computing. 2000;10:325–337.

13. Food and Drug Administration [December, 2009];Guidance for industry. patient-reported outcome measures: Use in medical
product development to support labeling claims. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf
2009. [PMC free article] [PubMed]

14. Food and Drug Administration [December, 2009];Guidance for industry. patient-reported outcome measures: Use in medical
product development to support labeling claims. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM155480.pdf
2006. [PMC free article] [PubMed]

15. European Medicines Agency Committee for medicinal products for human
use (CHMP) [December, 2009];Reflection paper on the regulatory guidance for the use of
health-related quality of life (hrql) measures in the evaluation of
medicinal products. http://www.emea.europa.eu/pdfs/human/ewp/13939104en.pdf
2005.

16. Rasch G. Probabilistic models for some intelligency and attainment
tests. The University of Chicago Press; Chicago: 1980.

17. Weesie J. [December, 2009];The rasch model in stata. http://www.stata.com/support/faqs/stat/rasch.html
1999.

18. Agresti A. Categorical data analysis. Wiley; Hoboken, NJ: 2002.

19. Muraki E. A generalized partial credit model: Application of an em
algorithm. Applied Psychological Measurement. 1992;16:159–176.

20. Fox JP. Bayesian Item Response Modeling: Theory and Applications. Springer; New York: 2010.

21. Curtis SM. Bugs code for item response theory. Journal of Statistical Software. 2010;36:1–34.

22. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of complexity and fit. Journal of the Royal Statistical Society, SeriesB. 2002;64:583–639.

23. Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982;47(2):149–174.

24. Thissen D. Marginal maximum likelihood estimation for the one-parameter
logistic model. Psychometrika. 1982;47(2):175–186.

25. Birnbaum A. Some latent trait models and their use in inferring an examinee's
ability. In: Lord FM, Novick MR, editors. Statistical theories of mental test scores. Addison-Wesley; Reading, MA: 1968. pp. 397–479.

26. Thissen D, Steinberg L. A taxonomy of item response models. Psychometrika. 1986;51(4):567–577.

27. Lord FM, Novick MR, editors. Statistical Theories of Mental Test Scores. Addison-Wesley; Reading, MA: 1968.

28. Bock RD, Aitkin M. Margnal maximum likelihood estimation of item parameters:
Application of an em algorithm. Psychometrika. 1981;46(4):443–459.

29. Albert JH. Bayesian estimation of normal ogive item response curves using
gibbs sampling. Journal of Educational Statistics. 1992;17:251–269.

30. Bafumi J, Gelman A, Park DK, Kaplan N. Practical issues in implementing and understanding bayesian ideal
point estimation. Political Analysis. 2005;13(2):171–187.

31. Lee PM. Bayesian Statistics: An Introduction. 3rd edn. Arnold Publishers; London: 2004.

32. Patz RJ, Junker BW. Applications and extensions of mcmc in irt: Multiple item types,
missing data, and rated responses. Journal of Educational and Behavioral Statistics. 1999;24(4):342–366.

33. de la Torre J, Stark S, Chernyshenko OS. Markov chain monte carlo estimation of item parameters for the
generalized graded unfolding model. Applied Psychological Measurement. 2006;30(3):216–232.

34. Martin AD, Quinn KM, Park JH. MCMCpack: Markov chain Monte Carlo (MCMC) Package. 2010 URL http://CRAN.R-project.org/package=MCMCpack, r package
version 1.0-8.

35. Congdon P. Bayesian Statistical Modeling. John Wiley & Sons Ltd.; Chichester, England: 2006.

36. Revelle W. psych: Procedures for Psychological, Psychometric, and Personality
Research. Northwestern University; Evanston, Illinois: 2010. URL http://personality-project.org/r/psych.manual.pdf, r package
version 1.0-90.

37. Goldberg L. The development of markers for the big-five factor
structure. Psychological Assessment. 1992;4:26–42.

38. Gelman A, Rubin DB. Inferene from iterative simulation using multiple
sequences. Statistical Science. 1992;7(4):457–511.

39. Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical
models. Cambridge University Press; New York: 2007.

40. Jackman S. Bayesian Analysis for the Social Sciences. John Wiley & Sons, Ltd.; Chichester, United Kingdom: 2009.

41. Ntzoufras I. Bayesian Modeling Using WinBUGS. John Wiley & Sons, Inc.; Hoboken, NJ: 2009.

42. Brooks SP, Roberts GO. Assessing convergence of markov chain monte carlo
algorithms. Statistics and Computing. 1998;8:319–335.

43. Cowles MK, Carlin BP. Markov chain monte carlo convergence diagnostics: A comparative
review. Journal of the American Statistical Association. 1996;91:883–904.

44. Geweke J. Evaluating the accuracy of sampling-based approaches to
calculating posterior moments. In: Bernardo J, Berger J, Dawid A, Smith A, editors. Bayesian Statistics. Vol. 4. Claredon Press; Oxford, UK: 1992. pp. 169–194.

45. Heidelberger P, Welch PD. A spectral method for confidence interval generation and run
length control in simulations. Communication of the ACM. 1981;24:233–245.

46. Raftery A, Lewis S. How many iterations in the gibbs sampler? In: Bernardo J, Berger J, Dawid A, Smith A, editors. Bayesian Statistics. Vol. 4. Claredon Press; Oxford, UK: 1992. pp. 763–774.

47. Nunnally JC, Bernstein IH. Psychometric Theory. 3rd edn. McGraw-Hill; New York: 1994.

48. Bond TG, Fox CM. Applying the rasch model: Fundamental measurement in the human
sciences. Lawrence Erlbaum Associates; Mahwah, NJ: 2001.

49. Karabatsos G. Comparing the aberrant response detection performance of
thirty-six person-fit statistics. Applied Measurement in Education. 2003;16:277–298.

50. Meijer RR, Sijtsma K. Methodology review: Evaluating person fit. Applied Psychological Measurement. 2001;25:107–135.

51. Muraki E. Information functions of the generalized partial credit
model. Applied Psychological Measurement. 1993;17:351–363.

52. Samejima F. Psychometrika Monograph, No. 17. Psychometric Society; Richmond, VA: 1969. Estimation of latent ability using a gesponse pattern of graded
scores. Retrieved in August, 2011 from http://www.psychometrika.org/journal/online/MN17.pdf.

53. Hogg RV, Craig AT. Introduction to Mathematical Statistics. Prentice Hall; Englewood Cliffs, NJ: 1995.

54. Samejima F. Some critical observations of the test information function as a
measure of local accuracy in ability. Psychometrika. 1994;59(3):307–329.

55. Samejima F. A general model for free-response data. Psychometrika Monograph, No. 18. 1972;37(1, Pt. 2)

56. Juniper EF, Guyatt GH, Feeny DH, Griffith LE, Ferrie PJ. Minimum skills required by children to complete health-related
quality of life instruments for asthma: Comparison of measurement
properties. Eur Respir J. 1997;10(10):2285–2294. English. [PubMed]

57. Downing SM. Validity: on meaningful interpretation of assessment
data. Medical Education. 2003;37(9):830–837. [PubMed]

58. De Boeck P, Wilson M. Explanatory Item Response Models: A Generalized Linear and Nonlinear
Approach. Spinger-Verlag; New York: 2004.

59. de la Torre J, Douglas JA. Model evaluation and multiple strategies in cognitive diagnosis:
An analysis of fraction subtraction data. Psychometrika. 2008;73:595–624.

60. Junker BW, Sijtsma K. Cognitive assessment models with few assumptions, and connections
with nonparametric item response theory. Applied Psychological Measurement. 2001;25:258–272.

61. Plummer M. Jags: A program for analysis of bayesian graphical models using
gibbs sampling.. Proceedings of the 3rd International Workshop on Distributed
Statistical Computing (DSC 2003); Vienna, Austria. March 20 – 22.2003.

62. Lee SH, Terry R. Irt-fit: Sas macros for fitting item response theory (irt)
models. [August 31, 2011];2005 from http://www2sascom/proceedings/sugi30/204-30pdf.

63. Lee SH, Terry R. [August 31, 2011];Mdirt-fit: Sas macros for fitting multidimensional item
response. 2011 at http://www2sascom/proceedings/sugi31/199-31pdf.

64. Pan T, Chen Y. [August 31, 2011];Using proc logistic to estimate the rasch model. 2011 from http://supportsascom/resources/papers/proceedings11/342-2011pdf.

65. Hardouin JB. Rasch analysis: Estimation and tests with
raschtest. Stata Journal. 2007;7(1):22–44.

66. Weesie J. [August 31, 2011];The rasch model in stata. 1999 from http://wwwstatacom/support/faqs/stat/raschhtml.

67. Zajonc T. [August 31, 2011];Openirt - bayesian and maximum likelihood estimation of item response
theory (irt) models in stata. 2011 at http://wwwpeoplefasharvardedu/tzajonc/openirthtml.

68. Sheu CF, Chen CT, Su YH, Wang WC. Using sas proc nlmixed to fit item response theory
models. Behavioral Research Methods. 2005;37:202–218. [PubMed]

69. Fisher RA. Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society. 1925;22:700–725.

70. Dodd B, Koch W. Effects of variations in item step values on item and test
information in the partial credit model. Applied Psychological Measurement. 1987;11(4):371–384.

71. Berger JO. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag; New York: 1985.

72. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2 edn. Chapman & Hall; New York: 2003.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |