In the 3rd
edition of their classic reference, Psychometric Theory
, Nunnally and Bernstein (1994) suggest that the use of informant-based measures is necessarily focused on measuring latent constructs rather than on observable phenomena [14
]. They point out that many biological, psychological and social states are, by nature, unobservable and must be inferred (e.g., physical pain, emotional well-being, satisfaction etc.). As a result, investigators rely on observable indications to infer individuals' standing on these unobservable latent constructs. The term 'latent' is used to emphasize that any set of measured observations, no matter how precise and elegant, is only an indirect approximation of an unobservable construct, and that all relevant observations are necessarily one step removed from the construct they are designed to measure.
Due to the indirect and inferential nature of PRO constructs, the validity of a measure is never simply demonstrated by reliable observation but must also shown to exist through confirmation of a measures' theoretical relationship to other established constructs or objective criteria. Two founders of psychometric theory, Crohbach and Meehl [15
] coined the term 'nomologic net' to describe the theory-building effects of construct validation activities; activities which act to continually expand a theoretical network of inter-related concepts [16
]. For example, evidence for a construct of social function could include demonstration of its association with established measures of both physical impairment and social support. Similarly, the associations observed between patients' responses to different items may provide evidence for the existence of an underlying and organizing construct. Specification of the relationships between items and the latent construct(s) define measurement models and these models are in turn used to help demonstrate the structural validity of new PRO scales.
Thus two characteristics are hallmarks of well designed measures, valid observational (i.e., item) content and confirmation of a structural relationship between items and the measurement construct. The first, content validity, is demonstrated by qualitative and quantitative evidence that items assess content that patients perceive is relevant to the construct of interest. In turn, structural validation efforts demonstrate how patients' ratings on these items are to be statistically related to each other so as to estimate the underlying construct. Modeling of the relationships between items and measurement constructs is most clearly depicted using structural equation notation.
Although complex forms of latent construct models are used in the social science and psychological literatures, so as to simplify discussion, the four basic elements common to all measurement models are presented in Figure . We begin by describing these four families of measurement models and provide a practical example of a PRO scale based on each.
Causes and Effects of Single or Multiple Observations.
Multiple Effect Indicator models
Starting with the top left quadrant of Figure with Model A, the Multiple Effect Indicator (MEI) model represents what most people commonly refer to as the latent construct or factorial model. In this model, the presumed relationship between the latent construct and the measured items is explicit, that is the factor loadings for each item (λ) represent the extent to which the variance in each item is explained by a common factor, a factor which is defined by the common variance across the entire set of items. The explicit relationship between the latent construct and the covariance between observed items can be expressed as in equation :
Yi = λI1η1 + ε1 
Where Yi represents the covariance in the measured items, η1 represents the underlying latent construct, which indexes the statistical intersection of all indicators, λI1 indexes the statistical relationship between the latent construct and the measured items, and the ε1 represents the random measurement error in the ith measured item.
The Convenience scale of the Treatment Satisfaction Questionnaire for Medication (TSQM v. 1.4) is an example of a PRO scale that is based on an MEI model [18
]. Shown below, three items comprise the Convenience scale, and are intended to measure patients' satisfaction or dissatisfaction with the convenience-inconvenience of medication use (see Figure ).
Items comprising the TSQM Convenience Scale.
The defining characteristic of MEI models is all items measure related aspects of the same satisfaction-dissatisfaction construct (Convenience). As a result, all three of these TSQM items are essentially interchangeable with one another and item measurements are expected to be highly intercorrelated with one another. Moreover, when computing a scale score, the precision of the construct estimate is increased by averaging out random measurement error associated with any single item rating – theoretically, providing an error free estimate. Moreover, if any one of the item measurements provides a perfect construct estimate (i.e., contained no measurement error) there would be no additional benefit gained by including any of the remaining scale items.
Single Effect Indicator models
Model A1 depicts a Single Effect Indicator (SEI) model being used to measure a latent construct. As mentioned, the SEI model is a special case of the MEI model, where a single item is assumed to be single best and least error prone measure of the latent construct. The addition of more items would not contribute significantly to the precision of the construct estimate. Some constructs may be better assessed using a SEI model and a single item indicator than others. Pain severity assessment is one such example, where experience tells us that it is difficult to design different measures of pain severity, especially ones that have uncorrelated errors.
The SEI model can be expressed in structural equation terms as in equation :
Y1 = λI1η1 + ε1 
Where, Y1 is the single symptom or general measure, η1 is the underlying latent construct, λI1 is the factor loading that is fixed at "1.0" and ε1 is the error term that is fixed at "0". As the equation implies, the item is assumed to precisely measure the construct. The sole use of a single item, however, provides no way to evaluate the relative impact of measurement error on either the reliability or precision of a construct estimate, and typically this sort of information is only available from previous validation studies.
Many PRO measures include a generally worded single item indicator measure instead of, or in addition to, using a multiple item scale. An example of a generally worded SEI indicator of current 'Health' is included in the Health Assessment Questionnaire (see Figure ). Whether used knowingly or unknowingly, the use of general wording is one way to reduce the unexplained variance across heterogeneous samples, since ratings of more specifically worded items is influenced to a greater extent by individual differences and situational characteristics than ratings of generally worded content. Generally worded items allow respondents the freedom to interpret the meaning of a question and provide ratings based on their own unique experiences and life circumstances. This permits estimation of a general construct over very different observational contexts and respondent groups, since the much of the conditional variance remains essentially unaddressed or inferred [19
]. We will return to this point when discussing the activities associated with item design and content validation using various measurement models in Part II of this commentary.
A general rating scale of 'Current Health' used in the Health Assessment Questionnaire.
Multiple Cause Indicator models
A very different family of models is used to estimate a construct when its' various indicators are not expected to be highly correlated, but instead ask about somewhat unique aspects of the construct. Application of a Multiple Cause Indicator (MCI) model is based on the premise that each observation uniquely contributes to the precision of the overall estimate of the latent trait or condition. The most distinctive aspect of this model is that items are not interchangeable or even necessarily similar to one another. This differs from the statistical relationship between items in a MEI model, where the latent construct itself is defined by the covariance of related observations across a group of respondents. The psychometrically astute reader may recognize the distinction between MEI and MCI measurement modes as similar to the statistical distinctions between factorial analytic and regression approaches when using respondent-based measures.
Model B in Figure visually depicts an MCI model using structural equation notation. Compared with MEI and SEI models, the direction of the arrows is reversed and the model lacks error term for each of the observed indicators. A disturbance term (ς1) on the latent construct indicates the proportion of variance in the latent construct which is not accounted for by the weighted linear combination of the measured items. The MCI model is presented mathematically in Equation :
ξ1 = λ1ix1 + λ12x2 + λ13x3 ... λ1nxn + ς1 
Where: ξ1 is the latent construct, xi represents the measured causal indicators, λ1i is the coefficient weight linking the causal indicators to the latent construct and ς1 is the disturbance in the latent construct described above.
Since the construct is not identified internally by the (error free) statistical intersection of observed ratings, as it is in the MEI model, the importance and weights associated of each indicator must be established against a criterion variable used as proxy for the latent construct. Such proxy criteria are typically considered 'gold standards' in the field and may include diagnostic clinical interviews, laboratory classification, or very well established self-report measures of the construct of interest.
The Disability Index of the Health Assessment Questionnaire [20
] has the look of a scale based on a MCI model (see Table ). First, the items are clearly distinct and not interchangeable indicators of the underlying construct. For example, being unable to tie one's shoe is not the same as being unable to stand up. The performance of these activities relies on a different set of motor skills associated with differences in dexterity, balance and strength. Second, given the differences between items, the score estimate of total disability is based on a cumulative index of a number of different types of physical skills that a patient has difficulty performing. Thus the combination of individual MCI items is thought to assess the summative effect of unique aspects of disability rather than assess the true score for a specific type of skill deficit.
A Multiple Cause Indicator Measurement Model in the HAQ Disability Index
Symptom checklists and symptom severity measures are other examples of PROs which are often based on an MCI measurement model. Like the items used in the Disability Index, symptom severity items are often discrete due to differences across individuals in both the symptomatic expression of illness and the relative impact particular symptoms on overall ratings of symptom severity. As a result, one would expect such ratings to be more weakly correlated, exhibit more statistical independence, and have more skewed distributions (e.g., floor effects) than the interchangeable items used within an MEI scale. Moreover, the differential impact of certain types of symptoms on overall symptom severity may suggest that the item ratings need to be 'impact' weighted in order to provide the best estimate of the measurement construct.
Single Cause Indicator models
Model B1 in Figure illustrates the Single Cause Indicator (SCI) model, a less common variant of the MCI model, in which a single indicator is thought to be the single primary cause of a latent construct. The addition of other items which assess different causal determinants should not dramatically improve the predictive power of a measure appropriately based on a SCI model.
Equation  describes the SCI model in structural equation terms:
ξ1 = λ11x1 + ς1 
Where, ξ1 is the latent construct, x1 is the single causal indicator, λ11 is the coefficient connecting the single indicator to the latent constructs, and ς1 is the variance not explained in the latent construct. Like the MCI model, the strength of the causal relationship between the item and construct is defined using a proxy measure of the latent construct. The weight estimate of the indicator is essentially the amount of variance in the latent construct that the indicator explains.
A general SCI summary rating is sometimes used as a substitute for longer MCI measures. An example is a summary judgment scale used in the Health Assessment Questionnaire, the Health Status Visual Analog Scale (see Figure ). This single item asks respondents to consider 'all the ways' the disease affects them and the single item presumably reflects the mental combination of a number of different (perceived) arthritic causes of their overall health status. As discussed earlier, details about causes that impact respondents' overall rating are unspecified and allowed to differ across individuals. As a result, such items tend to provide more normally distributed scores than content-specific MCI items which may be relevant to a proportion of respondents.
A 'Health Status' visual analog scale used in the Health Assessment Questionnaire.
Multiple Mixed Indicator models
One final family of measurement models is what Bollen and Lennox [21
] refer to as the Multiple Mixed Indicator (MMI) model. As the name implies, such models contains a combination of items to measure the causes and effects of two or more latent constructs and are diagrammatically represented by combining two or more of the four basic measurement models. In some cases, these latent structures are hierarchical, for example, when a general construct is thought to be the effect of a series of more specifically defined latent constructs and content-specific items (see Figure ) [22
]. As will be discussed further in Part II, MMI models can be used to provide support for the structural validity of either MCI, SCI or SEI measures which require a criterion measure to estimate the latent construct.
Unfortunately, explicit MMI modeling does not often occur by design but usually occurs when scale constructors fail to distinguish between the different types of relationships between observed measures and latent constructs. Bollen and Lennox [21
] utilized the Center For Epidemiologic Studies Depression Scale (CES-D) as such an example, in which they pointed out that the item that measures feelings of sadness or of being depressed appears to be the effects of depression; whereas items that assess loneliness may be caused by depression and items that measure perceptions of attractiveness may be reciprocally related to the depression. Such problems are very common and the astute reader may be able to identify a minor area of model mis-specification in Figure .