PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Struct Equ Modeling. Author manuscript; available in PMC 2017 September 19.
Published in final edited form as:
PMCID: PMC5603211
NIHMSID: NIHMS835935

Test Reliability at the Individual Level

Abstract

Reliability has a long history as one of the key psychometric properties of a test. However, a given test might not measure people equally reliably. Test scores from some individuals may have considerably greater error than others. This study proposed two approaches using intraindividual variation to estimate test reliability for each person. A simulation study suggested that the parallel tests approach and the structural equation modeling approach recovered the simulated reliability coefficients. Then in an empirical study, where forty-five females were measured daily on the Positive and Negative Affect Schedule (PANAS) for 45 consecutive days, separate estimates of reliability were generated for each person. Results showed that reliability estimates of the PANAS varied substantially from person to person. The methods provided in this article apply to tests measuring changeable attributes and require repeated measures across time on each individual. This article also provides a set of parallel forms of PANAS.

Keywords: Individual Reliability, PANAS, Parallel Tests, SEM

Introduction

Reliability is a generic concept referring to the precision of measurement (Lord & Novick, 1968, p.139). In classical test theory (CTT, see Gulliksen, 1950; Lord & Novick, 1968; Allen & Yen, 1979), an observed score xpt for person p at time t is a sum of a true score τpt and a measurement error εpt (see Equation 1), and the reliability coefficient ρ, a quantitative index of reliability, is defined as the ratio of the variance of true scores to the variance of observed scores (see Equation 2).

xpt=τpt+εpt
(1)
ρ=σ2τσ2x
(2)

The conventional ideas to estimate reliability (e.g., the test-retest reliability) often treat true scores as time invariant. So usually the variance of true scores is not obtained from one person, but through a group of people. If we assume that error variances are equal across persons, Equation 3 shows one way to estimate the reliability coefficient ρt, which is to obtain the variance of true scores and the variance of observed scores across persons at one time t.

ρ^t=σ^τpt2σ^xpt2=(τpt-τ¯.t)2/P(xpt-x¯.t)2/P(p=1,,P)
(3)

However, the conventional approach calculating reliability based on variation across persons assumes that error variances are equal from person to person, i.e., the reliability coefficient is the same for everyone, which in reality may not be the case. What if error variance is not homogeneous across people? Assume that the true score of a person varies with time and that error variance are equal across times, but not necessarily people. Then if we measure one person multiple times, we can use Equation 4 to estimate the reliability coefficient ρp for the person p.

ρ^p=σ^τpt2σ^xpt2=(τpt-τ¯.p)2/T(xpt-x¯.p)2/T(t=1,,T)
(4)

This equation allows us to ask a new question: Are the values of the individual reliability coefficient ρp different from person to person? If they are practically different, is it still right to estimate reliability only at the population level? The present study attempts to answer these questions.

In the past three decades, reliability theory has developed substantially (Traub, 1994). Many modern psychometric texts emphasize that reliability is a property of test scores from a particular test for a particular population or “class-specific” (Raykov & Maroulides, 2015). However, estimating reliability for a person is still very rare. Lord & Novick (1968) considered reliability for the time series of a single subject with the assumption that the true score does not vary. In their method, the true score variance is obtained across N persons, and the error variance is obtained across r repeated measures of one person with the assumption that the true score is constant across the r repeated measures. Then, Lumsden (1977) proposed to use person characteristic curves from item response theory (IRT) to estimate person reliability. However, in practice reliability is still estimated for a population rather than for individual population members. This study is another attempt to advocate the estimation of reliability at the individual level. We argue that when measuring attributes that change meaningfully (for person p, the true score τpt is not constant across time), reliability could, and perhaps should, be estimated at an individual level.

Interindividual Variation vs. Intraindividual Variation

We suggest estimating reliability at an individual level not only because many psychological tests are designed to measure individuals, but also because using reliability obtained from a pooled sample to infer how reliably a test measures within-individual changes is mathematically incorrect. The reasoning behind this idea is as follows. In Equation 3, the variances of τ and x are interindividual; and in Equation 4, the variances of τ and x are intraindividual. Molenaar (2004) pointed out that generalizing about intraindividual variation from interindividual variation was a common mistake made by psychologists, who often analyze data by pooling scores across individuals and then using the conclusions to explain within-individual changes, such as development, learning performance, emotion fluctuation, and so on. According to mathematical and statistical theorems from ergodic theory (Petersen, 1983), an ergodic process is a process in which the structures of interindividual variation and intraindividual variation are equivalent. However, most psychological processes are nonergodic, and the structures of interindividual variation and intraindividual variation differ arbitrarily or are completely dissimilar (Q. Zhang & Wang, 2014). Figure 1 illustrates a nonergodic process, in which variable X and variable Y are positively correlated interindividually, but negatively correlated intraindividually. In this case, the conclusion drawn from interindividual variation cannot be generalized to intraindividual variation. Hence, Molenaar (2004, p.209) asserted, “claims based on classical test theory that a test is valid and reliable cannot be generalized to individual assessments of development, learning, or any other non-stationary process”. Thus to evaluate how reliably a test measures within individual changes, perhaps we should use intraindividual variation (Equation 4) instead of interindividual variation (Equation 3) for the estimation process.

Figure 1
Illustration of a nonergodic process.

Trait vs. State

Using Equation 4 to estimate reliability requires that the true score of one individual on the attribute being measured is a variable that can take different values at different time points, otherwise, if τpt for person p is a constant, the variance of τpt across time is always zero. Please note that τpt being a variable does not mean it must take different values at different time points. A variable can take any value including the same value at every time point. After a review of the literature, we found the distinction between trait and state might provide some insights on the nature of the true score.

Traits are characterized by temporal and cross-situational stability, whereas states manifest intraindividual variability or genuine changes in behavior (Nesselroade, 1988). For example, irascibility is a trait, indicating how easily a person is provoked to anger, whereas anger is a state of strong feeling of displeasure and belligerence aroused by a wrong. Researchers began to notice the state aspect of behavior when low correlations were obtained by measuring the same behavior in different situations. Subsequently, the “trait–state” discussion/debate has never stopped. Some people differentiate traits from states by whether they are influenced by genes versus by environment (Falconer, 1960). However, Kraemer, Gullion, Rush, Frank, and Kupfer (1993) argued that traits could be either genetic or environmental in origin, such as scars of life experiences or prior episodes of illness. Thus the sources of traits and states may not be that different. The real distinction between state and trait is in the nature of intraindividual variability. When a test is measuring a state, the true score of an individual is a random variable that can take different values; however, when a test is measuring a trait, the true score of an individual is constant at least in short-term. In order to study intraindividual variability, in this article we are only concerned with tests that measure state. To be specific, the state variables of interest should exhibit real change over relatively short spans of time, and any systematic change, such as developmental growth or decline, non-systematic change, such as short-term fluctuations, linear or nonlinear change (Wang & Grimm, 2012) is regarded as real change.

Redefine True Score

Until now, we have mentioned true score several times but have not yet defined it. A person’s true score is commonly defined as the expectation over an infinite number of independent administrations of the test. Thus, for repeated measures the variation in observed scores is all due to measurement error. This definition does not apply when a person experiences real change. Zimmerman (1975) defined true score as the conditional expectation of a test score when the conditioning is taken with respect to the subject. Enlightened by this definition, in this article, true scores do not represent properties of a person, but represent properties of “a person-in-a-situation” (Steyer, Mayer, Geiser & Cole, 2015). If we could obtain infinite observations from one individual p at time t, true score τpt is the mathematical expectation of all observations. So for repeated measures administrated at different time points, the variation in observed scores includes not only measurement error but also long-term developments and short-term fluctuations. In this article, both long-term and short-term changes are considered as true score changes.

Approaches to Estimating Reliability

To estimate reliability, we need to disentangle the true score variability from the measurement error variability. We can use parallel tests. According to Lord and Novick (1968), two tests are parallel if they have identical true scores (τ-equivalent), their error variables are uncorrelated, and they have identical error variances. They also derived that the correlation between two parallel forms of a test equals the reliability coefficient ρ. In Equation 5, we use ρpar to stand for the reliability coefficient estimated based on the parallel approach, and use X and X’ to stand for the variables of observed scores from two parallel tests.

ρ^par=cor(X,X)
(5)

Traditionally, the correlation is calculated between two tests among multiple persons at one time. To estimate individual reliability, we calculate the correlation between two tests administered at multiple times for one person. Using parallel tests to estimate reliability is not an innovation. However, using parallel tests with longitudinal data is a novel practice. By this method, test reliability can be estimated at the individual level.

Besides parallel form reliability, if a test has multiple items, we can also estimate reliability using a SEM method (e.g., Raykov & Shrout, 2002; Green & Yang, 2009; Raykov & Marcoulides, 2015). Suppose a test uses multiple items to measure one latent state. The observed score of each item comprises a true score and a measurement error. Assume measurement error for one item is uncorrelated with other items. The common latent factor represents the part of the true score shared with other items, and the remaining part of the true score is specific to the item, which together with the measurement error constitutes the unique part of each item. If we are interested in measuring the latent common factor, the larger the common part compared with the unique part, the more reliable the test is. According to Raykov & Marcoulides (2015), the reliability coefficient can be expressed as Equation 6, where the λk stand for the factor loadings of item k (k = 1, ..., K), σ2F stands for the variance of the factor, and σ2Uk stands for the variance of the unique part of item k.

ρSEM=σ2F(k=1Kλk)2σ2F(k=1Kλk)2+k=1Kσ2Uk
(6)

We can use SEM to estimate individual reliability. For person p with T times of measurement on K items, we can conduct a confirmatory factor analysis on these items using his or her T times of measurement, and then substitute sample estimates of the population parameters in Equation 6 to obtain the test reliability for this person.

The Present Study

The principal goal of this study is to put forth the idea of “individual reliability” by analyzing intraindividual variance, and apply it to tests known to measure state rather than trait attributes. We used two approaches to estimate individual reliability: the parallel form approach and the SEM approach.

Compared with the conventional way of estimating reliability, this study makes the following requirements and assumptions. (1) We study a sample of responses across many times for one individual (intraindividual variation) instead of a sample of responses across many individuals for a single time (interindividual variation). (2) We assume that the true score of one individual across time is a random variable so that the person’s true score can take different values across time. (3) We assume that error variances are equal across time for a given individual and may vary across individuals. Therefore, the conventional approach calculating reliability based on variation across persons cannot identify individual differences in reliability estimation, but our approach can do so.

The remainder of this article is organized as follows. First, some aspects of individual differences in test responses are stimulated, including variance of measurement error, factor loadings and the distribution of the latent factor. Then the influences of these sources of individual differences on reliability are investigated separately. Next, a set of empirical data is analyzed both from an interindividual variation perspective and an intraindividual variation perspective. The differences between interindividual variation models and intraindividual variation models are discussed, and as the core focus of this article, the individual differences in reliability of intraindividual models are highlighted.

Individual Reliability of Simulated Data

Throughout the simulation section, we use k to denote item, p to denote person, and t to denote time. Five persons’ responses to a one-factor, 3-item test on 100 times of measurement were simulated by the following procedures. First, we simulated 100 factor scores Fpt for person p at time t = 1, ..., 100. Then the observed score xkpt for person p at time t on item k was generated by Equation 7, where the λkp is the factor loading for item k of person p, the Fpt is the factor score of person p at time t, and Ukpt is the unique part of item k for person p at time t.

xkpt=λkpFpt+Ukpt
(7)

We also simulated a parallel form of this test by using the same factor scores and factor loadings to generate another set of item scores. Thus the two parallel forms share the same factor scores and have different items scores.

Simulation Conditions

According to different constraints on λkp and the distribution of Fpt and Ukpt, we had four simulation conditions as shown in Table 1.

Table 1
Simulated Parameters in Four Cases

Case 1: Individual differences in unique variances

The true score Fpt follows a normal distribution. We used the R function rnorm with mean 0 and standard deviation 1 to simulate 100 true scores Fpt for each person. We then constrained factor loadings λkp to be equal across individuals, and letting unique variances σ2Ukp differ from person to person. Then a confirmatory factor analysis (CFA) was conducted on the simulated data separately for each person using the OpenMx package (Boker et al., 2011) in R. The data structure is just as for a regular CFA, where the columns are items and the rows are observations, which usually are from different persons, but in this study are form different time points of one person. Users can construct separate data sets for each person or write a loop to perform the CFA model separately for each person. In the CFA model, we also estimated the reliability coefficient ρSEM based on Equation 6 and its confidence intervals using functions mxAlgebra and mxCI. The R code for the CFA model can be found in Appendix 1.

Case 2: Individual differences in factor loadings

The second set of simulated data used the same procedure to produce true scores, but constrained unique variances σ2Ukp to be invariant across individuals, and allowed factor loadings λkp to vary by individual. Then, a CFA model was conducted for each person.

Case 3: Unique variances and factor loadings invariant across individuals

The third set of simulated data also used the same procedure to produce the true score, and constrained unique variances σ2Ukp and factor loadings λkp to be invariant across persons. Again, a CFA model was conducted for each person.

Case 4: Violation of the Independence Assumption

Since we conducted CFA on repeated measures from one person, it is natural to suspect a violation of the assumption that observations should be independent. In order to investigate the influence of assumption violation on the estimation of reliability coefficient, we simulate 100 time series data for one person. The true score of this person was simulated using the R function arima.sim, to generate a time series of an autoregressive model. Then the unique variance and factor loadings were constrained to be the same as case 3. The only difference between case 3 and case 4 is the distribution of the true score. In case 3, the true scores follow a normal distribution, whereas in case 4, the true scores follow an ARIMA(1,0,0) first-order autoregressive model. We compare the estimates of case 3 and case 4 to investigate the influence of independence assumption violation.

Reliability under the Four Cases

Three types of reliability coefficients were obtained for each person separately under each case. The population SEM reliability (ρSEM) was obtained by substituting the simulated population parameters into Equation 6. The sample SEM reliability (rSEM) was obtained from the CFA model. The sample parallel form reliability (rpar) was obtained by calculating the correlation between sum scores of two parallel forms. We used three methods to obtain the reliability coefficient because when analyzing empirical data, there is no way to know the simulated parameters (i.e., population parameters), but the estimated parameters and the correlation between observed scores of parallel forms can still be obtained. We were interested in seeing how well using SEM to estimate reliability or using parallel forms to estimate reliability would approximate the results of using simulated parameters to calculate reliability. The latter method was possible only because of our simulation work.

Table 2 shows the population and the sample reliability coefficients, and Figure 2 shows the confidence intervals of the two sample reliability coefficients. According to Table 2, when individuals had different-sized factor loadings or unique variances, substantial differences in reliability estimates were found. Larger unique variances and lower factor loadings resulted in lower reliability. In case 1, when the mean unique variances increased from 0.04 to 1.00, ρSEM decreased from 0.96 to 0.52. In Case 2, when the mean factor loadings increased from 0.40 to 1.00, ρSEM increased from 0.57 to around 0.84. In Figure 2, the confidence intervals of the reliability coefficient for one person do not always overlap with another in case 1 and case 2, indicating significant individual differences. These substantial individual differences in the magnitude of reliability estimates lend considerable support to the justification of estimating reliability at the individual level. The simulated person with the moderate reliability is not being measured well by the simulated test. If this were real data, practitioners might wish to interpret that individual’s scores with greater caution.

Figure 2
Confidence intervals of the sample reliability coefficients for each person.
Table 2
Estimates and 95% CIs of Individual Reliability under Four Simulation Conditions

Table 2 also suggests that the sample reliabilities estimated by both the SEM approach and the parallel approach were close to the population reliabilities. Although the values of the estimated reliability were not exactly the same, the ranking of reliability of the five persons is the same for the three methods. The Spearman’s correlation between ρSEM and rSEM is 0.955, p < .001, and the Spearman’s correlation between ρSEM and rpar is 0.932, p < .001. This result implies that with real data parallel form reliability estimates and SEM reliability estimates are quite acceptable.

In addition, the estimates of reliability coefficient in case 4 (rSEM =0.77 and rpar=0.78) were similar to the estimates in case 3, and fell in the confidence intervals of every person in case 3. This result showed little sign of bias because of a violation of independence assumption. We may be safe to conduct CFA on repeated measures and still obtain an unbiased estimation on parameters. Actually, for factor analysis, the order of the data does not matter. We could shuffle observations of different time points, and still obtain exactly the same parameter estimation, just as the order of participants does not matter when we conduct a regular CFA on a group of people.

In sum, our simulation work reinforced the use of SEM and parallel forms to estimate reliability, and lent support to the justification of estimating reliability at the individual level using repeated measures. It remains important to provide evidence from real data that individual reliabilities indeed differ as we simulated, to bolster the case for estimating individual reliabilities in practice.

Individual Reliability of PANAS

Data Source

Individual differences in reliability of a widely used psychological scale, the PANAS (The Positive and Negative Affect Schedule, Watson, Clark, & Tellegen, 1988), were investigated. The PANAS includes two subscales: positive affect (PA), and negative affect (NA). Data came from the Twin Study of Hormones and Behavior across the Menstrual Cycle project (Klump et al., 2012) from the Michigan State University Twin Registry (Klump & Burt, 2006). To date, data collection has been completed on 418 female twins between the ages of 16 and 25. Each participant was measured once a day for 45 consecutive days on the PANAS.

Parallel Forms for PANAS

It would be ideal if a test had a parallel form ready to use. However, in reality, most tests do not have a parallel form. In this case, we can split the whole test into two parallel parts, calculate the correlation between the two parts, then use the Spearman-Brown prophecy formula presented in Equation 8 (Spearman, 1910; Brown, 1910) to obtain the reliability for the whole test. In Equation 8, K is the number of items of the whole test, rK is the correlation between two parallel tests and rK/2 is the correlation between two parallel parts.

ρpar=rK=2rK/21+rK/2
(8)

In the case of the PANAS, we had to split the items of each subscale (PA and NA) into two parallel parts. How could we accomplish this?

According to Gulliksen (1950), two tests are parallel when “it makes no difference which test you use.” Gulliksen (1950, chap.2, p.11) also states that the mean and standard deviation between two parallel tests should be equal. Thus, our target was to make the items in the two parts as similar as possible, and then test whether the mean and standard deviation of the total scores of the two sets of items were not different. Methodologists have suggested several criteria for parallel items, such as equal factor loadings, equal correlations between item scores and sum scores, and so on. While these approaches are certainly effective, taking an item response theory approach allows us to estimate parameters that are meaningful in a way that is particularly relevant to the present investigation. Using an item response model (IRM), we can estimate difficulty and discrimination scores for items on each subscale of the PANAS separately. The item difficulty and discrimination parameters are estimated on the same scale as the person parameters in an IRM. Thus, item parameters provide a meaningful quantification of how difficult items are for the present sample and how well the items discriminate between two participants with different levels of a given attribute, here PA and NA. The graded response model (GRM, Samejima, 1969), an IRM available in the R package Itm (Rizopoulos, 2006), was used because it is appropriate for Likert-type response scales, and allows for the estimation of both discrimination and difficulty parameters. Using the item parameters from GRM, a test can be divided into two halves that are on average equally difficult for and equally discriminating of the participants in the present sample.

Our data were obtained from a twin study, so inherently we have two sets of comparable data: data of twin 1 and data of twin 2. We took advantage of this and estimated the item properties separately in each set of data. Ideally, the properties of each item should hold constant in the two sets of data. Figure 3 illustrates the spatial relationship of items in the two-dimension space of difficulty and discrimination. The relative relationship between items basically remained the same in two sets of data, though the exact value of difficulty and discrimination varied slightly.

Figure 3
Difficulty and discrimination parameters of PANAS items.

Hierarchical cluster analysis (via the R function hclust) using Ward’s method was used to cluster the items based on their difficulty and discrimination values. Figure 4 shows the hierarchical structure of PA items and NA items in both sets of data. Items under the same branch are more similar than items belonging to different branches. As shown in these tree structures, Alert and Attentive are close in both sets of data, which argues for treating them as parallel items. Likewise, Interested and Excited, Enthusiastic and Determined, Afraid and Scared, Ashamed and Guilty, Upset and Distressed were identified as parallel pairs.

Figure 4
Cluster analysis of PANAS items based on their difficulty and discrimination.

Table 3 shows the items of two parallel parts. Then we tested whether the means and standard deviations of test scores from two parallel parts remained similar in both twin 1 data and twin 2 data. As shown in Table 3, the mean and standard deviation of test scores for two parallel parts are approximately the same, both in twin-1 and twin-2 data, with all p’s > .05. The parallel parts of the PANAS were considered to be successfully constructed, and were then used to estimate parallel form reliability.

Table 3
Means and Standard Deviations of Parallel Forms for PANAS in Twin Data

Interindividual Model and Intraindividual Model of PANAS

We randomly selected a subset of data consisting of 45 participants, all from different twin pairs. We purposely trimmed the data to make individuals and measurement occasions comparable in number. Then for each of the 45 days, a confirmatory factor analysis was conducted separately for PA subscale and NA subscale measured on 45 persons. In both the PA model and the NA model, the variance of the factor was fixed at 1, and the factor loadings and error variances were freely estimated. We named these models “Interindividual Models”. Since there are 45 days of measurement, there are 45 PA and 45 NA Interindividual Models in total. For each person separately, we fit the same factor model separately for the PA subscale and the NA subscale measured on 45 days. We named these models “Intraindividual Model”. Since there are 45 participants, there are 45 PA and 45 NA Intraindividual Models.

Table 4 presents the sample parameter estimates of the Interindividual Model and the Intraindividual Model. We listed the means and standard deviations of the factor loadings for Interested on PA, and Nervous on NA as examples of typical factor loadings results from these analyses. Similarly, the means and standard deviations of the unique variance estimates for Interested and Nervous are presented. As shown in Table 4, there are larger deviations in factor loadings estimates and unique variance estimates across the Intraindividual Models than across the Interindividual Models. Parameters estimated in Interindividual Models and Intraindividual Models were used to estimate SEM reliability.

Table 4
Examples of Means and Standard Deviations of Estimated Parameters in Interindividual Models and Intraindividual Models (n=45)

Interindividual Reliability vs. Intraindividual Reliability

Parallel forms and SEM methods were used to estimate reliability based on interindividual variation and intraindividual variation.

(1) Parallel form reliability

To estimate interindividual reliability, we summed the item scores of positive affect subscale part 1 and the item scores of positive affect subscale part 2, and then calculated the correlation between total scores on two parallel parts based on 45 participants’ observations from one day. Since we used two parallel parts instead of two parallel forms of the PANAS scale, the correlation is still not the reliability coefficient. Then, the correlation coefficient rK/2 (K = 10) was transformed into a reliability coefficient (rK) using Equation 8. We did the same for the negative affect subscales. To estimate intraindividual reliability, we used the same procedure except that the correlation is calculated based on 45 observations from one individual.

(2) SEM reliability

To estimate interindividual reliability, we substituted the parameters estimated in Interindividual Models into Equation 6. And to estimate intraindividual reliability, we substituted the parameters estimated in Intraindividual Models into Equation 6.

The means and standard deviations of the interindividual reliability and intraindividual reliability are shown in Table 5. The SEM approach and the parallel form approach produced similar results. There is greater variation in intraindividual reliability than in interindividual reliability. For example, for the positive affect subscale, the parallel form reliabilities for different days have a mean of 0.87, with SD = 0.03, implying that reliability does not vary much across time if it is estimated based on a pooled sample. In comparison, reliability has a huge standard deviation across people. For the positive affect subscale, the parallel form reliabilities for different persons have a mean of 0.75, with SD = 0.17. The PANAS is a reliable test for some people (with one standard deviation above the mean, the reliability for the PA subscale is 0.75+0.17=0.92, for the NA subscale it is 0.76+0.13=0.89), but unreliable for others (with one standard deviation below the mean, the reliability for the PA subscale is 0.75–0.17=0.58, and for NA it is 0.76–0.13=0.63).

Table 5
Means and Standard Deviations of Estimated Reliabilities and the 95% CIs (n=45)

To visually show the individual differences in intraindividual reliability, we plotted the point estimates and CIs of the SEM reliability of PA subscale for each person, and also compared them with the point estimates and CIs for each day. As shown in Figure 4, the values of the interindividual reliability are very similar and their CIs are all overlapped. In comparison, the values of the intraindividual reliability are quite different and some of their CIs do not overlap at all. The interindividual model averages individual’s measurement quality, thus reliability across interindividual models shows some sort of constancy with very small deviation. This phenomenon illustrates why usually we are not aware of much variation in reliability across people. Had we not explored intraindividual variation, we would not have discovered the substantial individual differences that are present in test reliability. These individual differences have major consequences for the conclusions we draw from individuals test scores.

Discussion

Restoring the individual as the unit of analysis in behavioral research has been the focus of much attention for many decades (see e.g., Carlson, 1971). More recently, strong pleas have been made to realize this re-emphasis ranging from arguments for more person-centered research approaches (Magnusson, 2003) to the compelling urging to make the individual the primary unit of analysis for studying human behavior (Molenaar, 2004). This work reported here is a response to the pleas. Compared with previous research on test reliability, this study is unique in two aspects: 1) we obtained and interpreted reliability at the individual level; 2) we focused on tests measuring state rather than trait.

The primary reason we suggest examining test reliability as an individual level phenomenon is that each individual is special, in some senses. Although test users typically reevaluate reliability when the test is translated to other languages or used with a distinctly different sample, as long as it is still in the same language and used to measure members of the same population, a test’s reliability is generally taken for granted. However, even in the same language, people’s understanding of words, their skillfulness of using numbers to report feelings, sensitivity to psychological state change, capacity of working memory, boredom, lack of interest in accurate responses etc., can be very different. Moreover, whether a given individual is internally consistent in his or her definition of words can also influence their test reliability. For those people for whom word meanings are inconsistent over time, for whatever reasons, changes in test scores are not an accurate estimate of their real state change. Rather, it might be a reflection of changes in their understanding and usage of words, or of any other construct-irrelevant variance. Test reliability for such people will likely be lower than for those with consistent usage and understanding of items. We emphasize reliability at the individual level, not only because people can be different in so many ways, but also because of the non-negligible size of individual differences in reliability estimates our analyses revealed. A well-established test such as the PANAS can vary from highly unreliable to highly reliable (see Figure 4) across people. Thus modeling at the individual level seems a task well worth undertaking. We are not saying that all tests need to be viewed at the individual level, but at least the user of a given test should be aware of the existence of large interindividual differences in reliability estimates. Treating reliability as a uniform property of tests can greatly oversimplify the complexity of psychological measures.

Targeting tests measuring changeable attributes is the major difference between this study and other methods of estimating individual reliability. Lord and Novick (1968) first provided the way to calculate true score variance and individual error variance. Although both their method and our method need to use repeated measures, we have clearly different assumptions. Their method requires that the multiple repeated measures share the same true score, whereas our method requires that the true score is changeable. Consequently, they used interindividual variance to obtain true score variance, while we used intraindividual variance to obtain true score variance. The development in Item Response Theory also promoted the idea to estimate reliability at the individual level in a variety of respects. For example, Lumsden (1977) suggested that measures of person fit and measures of the slope of the person response curve might be interpreted as an index of person reliability. The IRT method does not require repeated measures but requires many items distributed evenly across the range of ability to be considered, which is distinctly different from our method. If we weight person, item, and time equally, the conventional approach such as Cronbach’s alpha analyzes the variation across persons, IRT analyzes the variation across items, and our method analyzes the variation across times to estimate reliability. In sum, the uniqueness of this study is that we utilize the feature of changeable true scores in state measurement, and use intraindividual variation to estimate reliability, thus our method may have particular practical implications for the measurement of change and development.

Limitations and Implications

The parallel approach of estimating individual reliability has a major limitation in that we used the same pair of parallel forms on all participants, which seems contradictory to the spirit of this article. However, it is impractical to find a reliable individualized pair of parallel forms for each participant based on limited numbers of repeated measures. So besides the parallel approach, we also provided the SEM approach. The two approaches achieved similar results. The correlation between the parallel reliability and SEM reliability is 0.815 for the positive affect subscale, and 0.822 for the negative affect subscale, p < .001. Yet, the exact value of parallel reliability and SEM reliability is different. The un-personalized parallel forms might account for the difference. In addition, the parallel approach uses the sum score that does not consider the weight of items, whereas the SEM approach takes into account the different item loadings. In practice, if a test already has a well-established parallel form, test users can obtain individual reliability simply by calculating the correlation between total scores of two parallel forms. So even without the knowledge of SEM individual reliabilities can be estimated. On the other hand, the SEM approach is very useful for advanced users when a parallel test is not available.

Estimating reliability at the individual level provides more possibilities for future research. The estimated ideographic reliability can be saved for further use. For example, it can be used as a dependent variable. Test developers can explore factors influencing the reliability of a test, and based on the results refine the test or clarify the scope of the test. The estimated individual reliability can also be used as a screening variable, such that a longitudinal study could exclude those whose changeable attributes cannot be well measured by a given test. Moreover, estimating reliability at the individual level might have important implication for diagnostics in which the individual is the center of concern, so that an optimal matching of instrument to individual may lead to more meaningful suggestions to the individual. For some time now, we have witnessed the rapid development of statistical techniques to analyze intraindividual changes. In the future, these developments will no doubt be accompanied by the rapid development of portable devices and more and more data being collected within individuals. This will further reinforce the estimation of psychometric properties at the individual level.

Another area we would like to further investigate is the nature of individual differences in test reliability. The example data we used were from a twin study. Twin data were not essential to construct the parallel parts of PANAS, yet twin data naturally provided us a set of training data and a set of testing data. In this study, we utilized the twin data for cross validation, yet we haven’t done any behavioral genetic analysis, which is beyond the scope of this study. In the future, we can compare the monozygotic and dizygotic twins to explore the factors that influence the individual differences in test reliability.

This study provides additional evidence that psychological processes are nonergodic. We showed that models based on interindividual variance are substantially different from models based on intraindividual variance. In the work reported here, Intraindividual Models have more variation in factor loadings and unique variance than do Interindividual Models. When we enter the world of intraindividual variation, matters change and much of what we “know” from the world of interindividual variation no longer applies. So researchers should not readily generalize results from interindividual approaches to intraindividual approaches, because these approaches are, in some sense, incommensurable. The interindividual approach is very useful in explaining individual differences, whereas the intraindividual approach should receive more attention in the study of change and development. As a science targeting the behavior of the individual, it makes sense that Psychology devotes considerable effort to exploring the intraindividual world. This study is one attempt to shift us in that direction.

Figure 5
Interindividual reliability and intraindividual reliability with 95% CIs

Supplementary Material

Appendix 1

Acknowledgments

The project described was supported by Award Number R21 AG034284 from the National Institute On Aging, and the funding for the data collection was provided by NIH Grant No. 5R01MH082054. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute On Aging or the National Institutes of Health.

References

  • Allen MJ, Yen WM. Introduction to Measurement Theory. Monterey, CA: Brooks/Cole; 1979.
  • Boker SM, Neale MC, Maes HH, Wilde MJ, Spiegel M, Brick TR, … Fox J. Openmx: An open source extended structural equation modeling framework. Psychometrika. 2011;76:306–317. [PMC free article] [PubMed]
  • Brown W. Some experimental results in the correlation of mental abilities. British Journal of Psychology. 1910;3:296–322.
  • Carlson R. Where is the person in personality research? Psychological Bulletin. 1971;75:203–219. [PubMed]
  • Falconer D. Introduction to Quantitative Genetics. New York: The Ronald Press Company; 1960.
  • Green SB, Yang Y. Reliability of summed item scores using structural equation modeling: an alternative to coefficient alpha. Psychometrika. 2009;74:155–167.
  • Gulliksen H. Theory of Mental Tests. John Wiley and Sons Inc; 1950.
  • Klump K, Burt SA. The Michigan State University twin registry (msutr): Genetic, environmental, and neurobiological influences on behavior across development. Twin Research and Human Genetics. 2006;9:971–977. [PubMed]
  • Klump K, Keel PK, Racine S, Burt SA, Neal M, Sisk CL, Boker S, Hu Y. The interactive effects of estrogen and progesterone on changes in binge eating across the menstrual cycle. Journal of Abnormal Psychology. 2012;122:131–137. [PMC free article] [PubMed]
  • Kraemer HC, Gullion CM, Rush AJ, Frank E, Kupfer DJ. Can state and trait variables be disentangled? A methodological framework for psychiatric disorders. Psychiatry Research. 1993;52:55–69. [PubMed]
  • Lord FM, Novick MR. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley; 1968.
  • Lumsden J. Person reliability. Applied Psychological Measurement. 1977;1:477–482.
  • Magnusson D. The person approach: concepts, measurement models, and research strategy. New Directions for Child and Adolescent Development. 2003;101:3–23. [PubMed]
  • Molenaar PCM. A manifesto on psychology as idiographic science: bring the person back into scientific psychology, this time forever. Measurement. 2004;2:201–218.
  • Nesselroade JR. Sampling and generalizability: Adult development and aging research issues examined within the general framework of selection. In: Schaie K, Campbell R, Meredith WM, Rawlings S, editors. Methodological Issues in Aging Research. New York: Springer-Verlag; 1988.
  • Petersen K. Ergodic Theory. Cambridge, England: Cambridge University Press; 1983.
  • Raykov T, Shrout RE. Reliability of scales with general structure: Point and interval estimation using a structural equation modeling approach. Structural Equation Modeling. 2002;9:195–212.
  • Raykov T, Marcoulides GA. Scale reliability evaluation with heterogeneous populations. Educational and Psychological Measurement. 2015;75(5):875–892.
  • Rizopoulos D. Itm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software. 2006;17:1–25.
  • Samejima F. Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement. 1969;34:100–114.
  • Spearman C, Charles Correlation calculated from faulty data. British Journal of Psychology. 1910;3:271–295.
  • Steyer R, Mayer A, Geiser C, Cole DA. A theory of states and traits-revised. Annual Review of Clinical Psychology. 2015;11:71–98. [PubMed]
  • Traub RR. Reliability for the Social Sciences. Sage Publications, Inc; 1994.
  • Wang L, Grimm KJ. Investigating reliabilities of intraindividual variability indicators. Multivariate Behavior Research. 2012;47:1–31. [PubMed]
  • Watson D, Clark LA, Tellegen A. Development of validation of brief measures of positive and negative affects: The PANAS scales. Journal of Personality and Social Psychology. 1988;47:1063–1070. [PubMed]
  • Zhang Q, Wang L. Aggregating and testing intra-indivdual correlations: methods and comparisons. Multivariate Behavior Research. 2014;49:130–148. [PubMed]
  • Zimmerman DW. Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika. 1975;40:395–412.