|Home | About | Journals | Submit | Contact Us | Français|
In assessing criminality, researchers have used counts of crimes, arrests etc. because interval measures were not available. Additionally, crime seriousness varies depending on demographic factors. This study examined the Crime and Violence Scale (CVS) regarding: psychometric quality using item response theory (IRT); and invariance of the crime seriousness hierarchy for gender, age, and racial/ethnic groups on 7435 respondents. The CVS is a useful measure of criminality, though some items could be improved or dropped. Differential item functioning analysis revealed that crime seriousness varies by age and gender. IRT shows promise in assessing and adjusting for demographic variations in crime seriousness.
Although it may seem obvious that there is a hierarchy of seriousness in the construct of crime and violence, most measures consist of counts, such as the number of reported crimes, arrests, and convictions because hierarchical, linear, interval measures have not been available. Many authors have noted that merely counting the number of crimes committed gives a faulty estimate of criminality because this method weights all crimes equally, e.g., larceny treated as equal in seriousness to murder (Anderson and Newman 1998; Kwan et al. 2000; Wilkins 1980). This problem is further compounded in counts of the numbers of crimes or arrests since less serious crimes (e.g. shoplifting, prostitution) are often much more common than more serious crimes (e.g., assault, murder).
However, hierarchical linear measures are beginning to be developed. For example, Piquero et al. (2002) recently used item response theory (IRT) measurement, which provides an empirically derived hierarchy, to analyze one of the most commonly used delinquency measures, i.e., the Self-Reported Delinquency Scale (Elliott et al. 1985). The Cronbach’s alpha for the original 24-item scale was .76, whereas the reliability of the 9-item version that fit the IRT/Rasch model was .58. Piquero et al., (2002) noted that more remains to be done in developing self-reported delinquency measures. In the 2003 workshop on “Measurement Problems in Criminal Justice Research,” one of the key issues of concern was “improving the reliability and validity of self-report surveys, rather than simply assessing these characteristics” (Pepper and Petrie 2003, p. 8). Thornberry and Krohn (2003) argued for the need to better understand self-administration, e.g., response errors across self-report and administrative surveys. They also called for the development of instruments to better measure serious offenses, since early self-report scales tended to ignore serious criminal and delinquent events and concentrated almost exclusively on minor forms of delinquency. Perhaps the major concern in assessing crime seriousness is its variability depending on demographic factors since there is consensus that it is socially defined (Sellin and Wolfgang 1964; Rossi, Waite, Bose, and Berk 1974; Rossi, Simpson, and Miller 1985).
The purpose of this study was to examine the Crime and Violence Scale (CVS) in terms of: (1) psychometric quality using the IRT/Rasch measurement model, which provides a hierarchical linear measure (Bond and Fox 2007; Rasch 1960), as the standard; and (2) the invariance of the crime seriousness hierarchy for demographic factors, specifically: males vs. females, youth vs. adults, and racial/ethnic groups. If the items of the CVS were not invariant, then changes to the measures, e.g., dropping items and developing new ones, would be considered. The CVS is part of the Global Appraisal of Individual Needs (GAIN) long and short forms (Dennis et al. 2006) and, by the end of 2008, was already in use by over 850 agencies in 47 states. Thus understanding its properties and/or potentially improving it have great potential to impact the field.
One of the earliest attempts to create a linear, interval psychological measure was by Louis Thurstone on the construct of crime (Thurstone 1927). Using his method of paired comparisons, Thurstone administered a list of 19 crimes to 266 students at the University of Chicago. Respondents ranked the crimes in seriousness by comparing all possible pairs (n=171 pairs). Thurstone’s scaling method produced a linear, interval hierarchy ranging from the least serious, such as vagrancy, to more serious, such as larceny and assault-battery, to even more serious, such as arson and kidnapping to the most serious, such as rape and homicide. Although the method was never widely used because of its heavy burden on respondents, Thurstone’s crime study was recently replicated twice with remarkably similar results (Kwan et al. 2002; Stone 2000).
Thurstone scaling provides a linear interval ruler of the crimes themselves (Kwan et al. 2002), but not of the persons, i.e., no mathematical relationship between the crimes and persons is possible. In other words, Thurstone scaling is less than ideal since it provides a quantitative estimate of the relative seriousness of the crimes, but not the relative amount of criminality of the persons who committed them.
While replications of Thurstone’s work have yielded quite similar results, cultural differences in ratings of the seriousness of particular crimes have been found. An example of this cultural sensitivity is the fact that prostitution is not a crime in some cultures whereas in others it is one of the most serious crimes. Rossi, Waite, Bose, and Berk (1974) used correlation and regression methods to analyze differences in perceived crime seriousness by age, educational attainment, gender and race. They concluded that while norms concerning crime seriousness were widely diffused throughout subgroups of society, educational attainment was the best predictor of agreement vs. disagreement with common norms.
As another example, Kwan et al. (2000) found a large difference in the perceived high seriousness of “drug offense” by Hong Kong residents compared to the low severity of the legal penalty, which was based in British law. Using Thurstone scaling, these authors found that women rated violent crimes against persons, such as rape and assault, as more serious than did men. They also found large differences in ratings based on age and educational attainment. They concluded that “crime seriousness is an evaluation mediated by the social structural context in which it is embedded” (p. 630).
While methods like Thurstone’s, sometimes referred to as normative rankings (Sellin and Wolfgang 1964), are most common, Cohen (1988) reported on an alternative method of ranking the seriousness of crime where actual victim injury rates were combined with jury awards in personal injury accident cases to estimate pain, suffering, and fear. Crime-related death rates were combined with estimates of the value of life to arrive at monetary values for the risk of death. These estimates were combined with out-of-pocket costs (such as medical costs and lost wages) to arrive at total dollar estimates of the cost of individual crimes to victims. These dollar estimates were then used to rank the seriousness of crimes. Such econometric methods are useful in understanding seriousness of crimes, but their psychometric application in assessing individuals’ criminality has not been demonstrated.
Some ways to measure criminality are to count the number of crimes in public records, or count arrests, or the amount of time spent in jail. Of course, there are problems with these methods. Most violent and criminal behaviors are not prosecuted or adjudicated so official records are lacking in assessing this construct. The number of crimes committed does not take into account the seriousness of those crimes. The arrest record does not take into account how well one is able to avoid arrest. Time spent in jail does not take the ability to avoid jail time into account. All of these may vary depending on local and regional statutes, sociodemographic characteristics, and so on. See Pepper and Petrie, eds. (2003) for a thorough discussion of problems measuring crime and violence.
This leads unavoidably to estimating crime and violence through self-report. Philosophically, this is consistent with the notion that crime seriousness is not an objective attribute but is subjectively perceived among citizens (Black 1979). In fact, as Piquero, et al., (2002) noted, so much youth crime escapes official detection that self-report delinquency scales have formed the basis of much of our understanding of delinquency today.
Of course, crime is a sensitive issue, and some may be reluctant to report some behaviors out of fear of self-incrimination. Rather than directly ask about crimes, the principal method of assessing adults’ criminality has been to measure psychological and other characteristics of persons that predict crime. For example, the Psychological Inventory of Criminal Thinking Styles (PICTS) is designed to assess eight thinking styles hypothesized to support and maintain a criminal lifestyle (Walters 2002), but it is not a measure of criminality itself. Another personality assessment, the Hare Psychopathy Checklist (Hare 2003) obtains professional ratings, e.g., psychologist or social worker, on two factors. Factor 1 is labeled “selfish, callous and remorseless use of others.” Factor 2 is labeled as “chronically unstable, antisocial and socially deviant lifestyle.” Walters, et al. (1991) developed the Lifestyle Criminality Screening Form (LCSF), a 14-item screening instrument designed to identify life-style criminality that is divided into four primary sections (irresponsibility, self-indulgence, interpersonal intrusiveness, and social rule breaking). It does not assess types of crimes, but asks for their number, thereby treating all crimes as equal in weight.
Item response theory (IRT) measurement models enable the placement of individuals on the ruler in relation to the crimes (Piquero et al. 2002). The relationship between the person’s ability and an item’s difficulty is estimated mathematically using probability estimators (Rasch 1960; Wright and Stone 1979). This is called the Rasch or 1-parameter IRT model. Where applicable, additional parameters can be added to measure item slope (2 parameter IRT model) and guessing (3 parameter IRT model). Unlike Thurstone scaling which only scales crimes, the IRT/Rasch method places both persons and crimes on the same ruler, and it is the only method that provides a linear, interval ruler like those used in the physical sciences (Embretson and Reise 2000; Wright and Stone 1979). These methods can also focus on the assessment of differential item functioning (DIF) on the linear measure between subgroups of individuals that may be the result of real differences in prevalence, cultural perception or measurement bias (Conrad et al. 2007). Therefore, this method employs a similar key assumption to that proposed by Ramchand, et al. (2009): if for some group an offense, A, is less severe than another, B, then members of that group will be more likely to engage in A before they engage in B, rather than the reverse. While the Ramchand et al. (2009) method examines temporal precedence, the IRT/Rasch method uses self-reported prevalence to estimate likelihood/probability.
Frequency or endorsement probability works very well to define difficulty in education. Likewise, after seeing hundreds of articles in health and related areas using the Rasch model (Conrad & Smith, 2004), we can say that it also works well to define illness severity. Our proposition was that it would work well to define crime seriousness. The logic is that things that are more valued, e.g., human life, are more protected and more punished when taken or violated. Therefore, crimes form a hierarchy of difficulty based on estimated risk. Less risky crimes are done more often and more risky crimes less often. Therefore, crime seriousness is determined by the value of the object and subsequently the risk in taking it. For example, if you asked thieves (stealing being frequent) why they refrain from killing people (murder being more rare), we think they would say that it is because murder is a much more serious crime (though they may use other words that connote seriousness, e.g., capital crime, life in prison, etc.).
We recognize that this relationship is not simple. For example, Rossi, Simpson, and Miller (1985) pointed out the complexity of crime seriousness is so fine grained as to make endorsement probability seem “improbable” in assessing seriousness. For example, how can you take “cold-bloodedness” into account? We also noted earlier that there are clearly cultural, educational, gender, age, etc. differences that may challenge unidimensionality. While the Rasch model may not address all such issues, our analysis illustrates the capability of examining some such differences using differential item functioning (DIF) analysis.
While it is clear that females commit fewer crimes than males and that the crimes that females tend to commit are less violent and less serious, the issue in this study was not which gender commits more crimes or more serious crimes, but rather how we can measure someone’s criminality without bias. Bias may occur when there is differential item functioning (DIF). DIF refers to the observation that persons in different demographic groups score differently on an item even though they are at the same level on the underlying trait. DIF can produce “bias” if not corrected. Regarding such differential crime (item) seriousness per group, the study of sociodemographic variables such as age, gender, and race, using linear interval measurement as provided by the Rasch IRT model (Embretson and Reise 2000), could be informative of the sizes of group differences as measured on the Rasch ruler. These estimates could be used to adjust measures where they were perceived to be unfair, e.g., male standards unfairly applied to crimes such as rape where females are overwhelmingly the victims, or prostitution and commercialized vice that are most prevalent in females. Therefore, unlike other methods, IRT/Rasch measurement provides the capability to estimate the DIF and adjust for it so that measures may be more equally and fairly weighted across demographic groups using endorsement probabilities for persons and items rather than subjective opinions.
Employing the Self-Reported Delinquency Scale, Piquero et al. (2002) reported that gender DIF tended to follow gender role expectations, with males more likely to endorse theft, carrying a concealed weapon, hitting other students, sexual assault, and breaking and entering. Females were more likely to endorse running away from home, hitting a parent, and being “loud, rude or unruly in a public place.” These findings were consistent with Kwan et al. (2000) discussed above.
To give an idea of raw prevalence by gender, U.S. population statistics indicated that, in 2005, females were most highly represented in the following offenses: prostitution and commercialized vice (74% females), runaways (58%), embezzlement (44%), and larceny-theft (42%), none of which are violent crimes (OJJDP Statistical Briefing Book 2007). Females were least represented in: forcible rape (only 2% females), gambling (2%), robbery (9%), and murder/manslaughter (10%), three of which are violent crimes (OJJDP Statistical Briefing Book, 2007).
Again, Kwan et al. (2000) found what they interpreted as substantial differences in crime seriousness among the age groups 18–30 and 46+ in Hong Kong. For example, 37% of the older group thought that “possession of arms” was even more serious than “rape,” while only 5% of the younger group thought so. Many of the authors’ interpretations of their findings had to do with local history that was experienced by the older group but not by the younger group. Piquero et al. (2002) found among 11 to 17 year olds that DIF concerned the younger, i.e., 11 and 12 year olds being more likely to endorse violent crimes, e.g., gangs and strong arming, and sexual offenses while the older adolescents were more likely to endorse crimes involving property.
Institutional anomie theory would expect violence to be higher in areas facing greater poverty and change (Kim and Pridemore 2005; Messner and Rosenfeld 1997). It posits that lower-class African-American youth, especially males, are at most risk of selecting violent and criminal responses (Tatum 2000). Piquero et al. (2002) found racial/ethnic differences among whites, blacks, and Hispanics where stealing something <$5, hitting a parent, disorderly conduct, selling hard drugs, prostitution and strong arming teachers had large significant DIF coefficients (p<.001).
In summary, we might expect greater differences by gender and age than by race. In their study of the effects of race, gender, age, and social status on crime serious, Rossi, Simpson, and Miller (1985) concluded: “Perhaps the most outstanding feature of these findings concerning social characteristics of offenders was how slight were their effects. The largest and most consistent effect was that of gender” (p. 77).
The data were comprised of 7,435 cases from the 77 studies involving persons being screened for substance abuse in three dozen locations around the United States that used the GAIN (described below). Over two thirds of these studies were conducted by independent investigators. They were funded by a wide range of organizations (e.g., the Center for Substance Abuse Treatment, National Institute on Alcohol Abuse and Alcoholism, National Institute on Drug Abuse, Robert Wood Johnson Foundation and Interventions Foundation) and conducted in a variety of institutional settings in screening for potential substance abuse treatment, including across adolescent and adult levels of care, student assistance programs, criminal and juvenile justice agencies, mental health agencies, and child protective service and family service agencies. All data were collected as part of general clinical practice or specific research studies under their respective voluntary consent procedures and were subsequently de-identified.
Research studies were conducted under the supervision of Chestnut’s Institutional Review Boards with general consents under federal guidelines (42 CFR Part 2) that explicitly allow record abstraction for program evaluation and development as long as the data are de-identified and kept confidential. Data obtained since the implementation of the Health Insurance Portability and Accountability Act of 1996 (HIPAA) were covered by formal data sharing agreements between Chestnut Health Systems and each of the participating agencies. All interviews were conducted by interviewers with three to four days of training followed by rigorous field-based certification procedures. Full details about the GAIN in general and a working paper on the CVS specifically may be obtained at the following: http://www.chestnut.org/LI/gain/
The data for this analysis came from 7,435 respondents who completed the CVS. As shown in Table 1, the sample was predominantly under 18 years of age (73%) and male (67%). Almost half were Caucasian (45%), a quarter were African American (26%), and the remainder Hispanic or mixed race. Of the top five primary drugs reported, marijuana was reported by 36% of the sample. The drug least often reported was opioids at 5%. Other drugs reported included amphetamines (8%), cocaine (10%), and alcohol (17%). Twenty-three percent of the sample reported other drugs. Valid measures were obtained on 7,424 persons, 99.9%.
The Crime and Violence Scale (CVS) is a count of increasingly violent strategies used for resolving interpersonal conflict in the past year and the types of drug-related, property, and interpersonal crimes the respondent has committed. It includes serious crimes such as homicide and rape. It is based on the Conflict Tactic Scale introduced in the Family Violence Survey (Strauss 1990) and lay versions of the Federal Bureau of Investigation’s (1993) uniform crime report categories introduced in the 1995 National Household Survey on Drug Abuse (Office of Applied Studies 1996) and predicts future crime and violence (White et al. 2003; White 2005).
The CVS consists of four conceptually distinct subscales with a total of 31 dichotomous items (Table 2). Its subscales are the: 12 item General Conflict Tactic Scale (GCTS), 7-item Property Crime Scale (PCS), 7-item Interpersonal Crime Scale (ICS), and the 5-item Drug Crime Scale (DCS). The item stem for the GCTS reads: “During the past 12 months, have you done the following things?” Response format is Yes/No (coded: no=0, yes=1). The item stem for the other scales reads: “During the past 12 months, how many times have you…” While the response set is in “times”, it is dichotomized to as 0 for none and 1 for one or more times for this scale. This analysis focused on the CVS taken as a single 31-item measure of the construct of crime and violence (a.k.a., criminality) and is a measure of the breadth of the types of violence and crime a person has engaged in during the past year.
As noted earlier, the CVS content was derived from the Conflict Tactic Scale and the FBI Uniform Crime Report categories. To examine how this compared to the content of another commonly used measure representing a similar construct using self-report, a table was composed consisting of the 31 CVS items along with the 24 items of the Self-Reported Delinquency Scale or SRDS (Elliott, Ageton, and Huizenga 1985; Huizinga & Elliott 1986). The SRDS was designed to include items that were representative of the full range of acts for which juveniles could be arrested and consisted of 24 items with a nine point Likert-type response scale ranging in frequency from 1=never, 2=once or twice a year and so on to 9= 2–3 times a week. While we recognize that delinquency is somewhat different from criminality, the literature has noted that self-report measures have lacked items concerning the more serious crimes (Thornberry and Krohn 2003) and, as noted above, they focused on personality indicators, arrest statistics, etc. for adults instead.
There was no way for us to equate or compare the CVS and the SRDS items mathematically in terms of their calibrations or measures because we had no sample that used both measures. Instead, we performed a content analysis of both scales to get some idea of how well the measures covered the spectrum of crimes for which adolescents and adults can be arrested. Though the measures were not directly comparable psychometrically since there were no common persons, the qualitative comparison may be enlightening in gaining perspective on their narrative content. We calculated the CVS measures for adolescents (<18) alone, as well as the measures for the full sample including adults. The adolescent calibrations were included to enable examination of age comparability to the Piquero et al. (2002) study which only included adolescents under 18(n= 1,719) using data from the first wave of the National Youth Survey, a longitudinal study of delinquent behavior among American youth (Huizinga & Elliott 1986). As a concurrent validity indicator, we included cost of crime data on 14 of the CVS indicators (McCollister, French, and Fang, in review).
In Rasch analysis the item hierarchy that is created by the item difficulty estimates provides an indication of construct validity (Smith 2001). The items should form a ladder of low seriousness symptoms on the bottom to high seriousness symptoms on the top. The SRDS and cost hierarchies were referenced to examine how well the Rasch-generated CVS hierarchy conformed, i.e., testing the validity of using frequency as a criterion for seriousness. Finally, we used this table to provide an item by item annotated summary of the results of all analyses conducted in this study.
As described by Embretson and Reise (2001) and others, the Rasch model (Rasch 1960) is the only IRT model that provides linear, interval measurement (Bond and Fox 2007; Wright and Stone 1979). This informed our choice of the Rasch model since the intervals involved in scaling crimes and persons were a major consideration in depicting crime seriousness.
The CVS was analyzed with a Rasch dichotomous model (Rasch 1960; Wright and Stone 1979) with Winsteps statistical software (Linacre 2007). The dichotomous model estimates the probability that a respondent will choose a particular response category for an item as:
where ln is the natural logarithm, Pni is the probability of respondent n endorsing item i, Pni +1 is the probability of respondent n not endorsing item i, Bn is the person measure of respondent n, Di is the difficulty of item i. These endorsement probabilities are concatenated over all person responses to all items to place persons and items on the Rasch ruler.
The Rasch model requires unidimensionality; that is, all items must be indicators of a single latent variable, in this case, criminality. While other small dimensions may be present, they must not be substantial enough to distort the measure of the construct of interest. The criteria for unidimensionality that we used were as follows. First, we required a high percentage of variance explained by the principal measurement dimension, e.g., > 40%. Second, we required a low percentage of residual variance explained by subsequent factors, e.g., < 15% explained by the first factor of residuals. As a comparison Reckase (1979) used 20% as substantial variance, so that our criteria were more conservative, i.e., requiring more variance explained by the measurement dimension and less of the rival factor. To further ensure unidimensionality, there must be good fit (described below) of items to the model, i.e., less than 1.33 on both infit and outfit statistics for each item (Linacre 1998; Smith 2002).
Person reliability, also known as internal consistency, is typically estimated with Cronbach’s alpha. We obtained alpha, as well as Rasch person reliability where our criterion for acceptable quality was .80. The estimate is based on the same concept as Cronbach’s alpha. That is, it is the fraction of observed response variance that is reproducible:
The denominator represents total person variability . The numerator represents the reproducible part of this variability, i.e., the amount of variance that can be reproduced by the Rasch model. The amount of variance that is reproducible with the Rasch model is called the adjusted person variability. The adjusted person variability is obtained by subtracting error variance from total variance . This reproducible part then is divided by the total person variability to obtain a reliability estimate for persons ( ) with values ranging between 0 and 1 (Wright and Masters 1982).
Person reliability is more conservative than alpha since it either estimates extreme scores as having high error or deletes the extreme scores from the estimate. Alpha is typically higher since it estimates extreme scores as perfectly measured. This can be particularly misleading when there are large subgroups with 0 on all measures (i.e., a floor effect) as seen in most measures of criminality.
Rasch analysis provides fit statistics to test assumptions of fundamental measurement (Wright and Stone 1979). “Fitting the model” simply means meeting basic assumptions of measurement, e.g., high scorers should endorse or get right almost all of the easy items. Understanding poor fit can lead to dropping items, improving them, or (when differences are real population differences), taking subgroup norms into account through calibration.
The fit of the data to the model is evaluated by fit statistics that are calculated for both persons and items. The Rasch model provides two indicators of misfit: infit and outfit. Calculation of Rasch fit statistics begins with the response residuals (yni) which estimate how far the actual response (xni) deviates from Rasch model expectations (Eni).
To standardize the residuals, we divide them by the item standard deviation, e.g., the formula for a z-score is the person’s score on the item minus the item mean divided by the item SD which has a mean of 0 and standard deviation of 1 (For complete explanation of this abbreviated discussion with examples, please see Wright and Stone, 1979; Wright and Masters, 1982). If we sum the squared standardized residuals and divide by N we get the outfit statistics, i.e., the mean square (MNSQ) standardized response residuals with an expected value of 1.0. Outfit statistics are sensitive to unexpected responses farther from the person’s measure:
For the infit statistic, the standardized response residuals are weighted by the individual variance (Wni) to lessen the impact of unexpected responses far from the person’s measure:
The infit is sensitive to unexpected or too much random behavior affecting responses to items near the person ability level or item difficulty level and the outfit is outlier sensitive. Mean square or outfit statistics are defined such that the model-specified uniform value of randomness is 1.0 (Wright and Stone 1979). Person fit indicates the extent to which the person’s performance is consistent with the way the items are used by the other respondents.). Items with high infit mean squares show a confused or random pattern that is more serious than outfit and reflects that these items are poor indicators of the construct (Bond and Fox 2007). The items with high outfit mean squares are items with more unexpected responses than are consistent with the model at the tails (i.e., endorsing high severity items but not low severity items). Using Wilson’s (2005) criteria for both infit and outfit, an item in this study was regarded as problematical if its misfit was higher than 1.33 mean square.
As Bond and Fox (2007) noted, the Rasch model requires that relative item estimates, i.e., item difficulty estimates, remain invariant across subgroups of persons, e.g., females and males. A DIF contrast is simply the estimate of the difference between groups among group members who are at the same level on the construct. It allows us to examine whether items have significantly different meanings for different groups. Bond and Fox suggest that items that show DIF should be investigated to determine what may be inferred about the underlying construct and what that implies about the subsamples of persons detected. In other words, it is an important validity criterion concerning the appropriateness of items and their calibrations. A significant DIF contrast was based on ≥ .9 logit difference for all comparisons which is approximately half a standard deviation (SD = 1.87) for the items. Half a standard deviation is a common criterion for clinical significance (Conrad et al. 2007; Norman et al. 2003).
For a complete treatment of Rasch analysis, we recommend Bond and Fox (2007) which includes a glossary of Rasch measurement terminology and Conrad and Smith (2004) for a brief summary with useful references. Terminology may also be accessed online via Rasch Measurement Transactions located at http://www.rasch.org/rmt/.
The items of both the CVS and the SRDS are displayed in Table 3 in descending order of seriousness with their respective Rasch measures, annotated evaluation comments, and cost where available. The CVS includes crimes of higher seriousness such as 24.ForcedSex, 25.Homicide, and 26.Arson (item numbers keyed to Table 2) that are not in the SRDS. As a result the CVS covers a greater range. Most of the SRDS items are in the mid-range. In general, the CVS item hierarchy seems appropriate except for 24.ForcedSex, 29.TradedSex, 15.Forgery/BadChecks, 1.DiscussedItCalmlySettledIt, and 2.LeftRoomRatherThanArgue. 24.ForcedSex is higher than 25.Homicide, which is counter-intuitive and counter to the cost data where the cost of rape was $245,032 and the cost for homicide was over 30 times higher at $8,635,611. The calibration for 29.TradedSex seems too high since it is higher in seriousness than violent crimes such as 20.ArmedTheft, 26.Arson, and 12.UseGunOrKnifeOnSomeone. 15.Forgery/BadChecks is also much higher than supported by the cost data at $5,133 since this is lower than most of the costed items below it. Two other items, in the middle range, may be unsupported by the costs. Specifically, 31.IllegalGambling is only costed at $8 so it is too high, and 22.HurtOtherNeedMedicalAttn is similar to aggravated assault at $127,573 so it may be too low. The items 1.DiscussedItCalmlySettledIt and 2.LeftRoomRatherThanArgue are not crimes.
For the SRDS, Sexual Assault was the sixth highest item in the hierarchy. Shouldn’t it be higher than those above it such as Stole >$50 and Prostitution? Aggravated assault was also too low since it is the 16th highest out of 24 items. Additionally, GangFights was the 3rd lowest. Regarding the intent of the SRDS to measure crimes for which adolescents can be arrested, is Sexual Intercourse such a crime? Is Runaway? The items with dollar values keep decreasing in worth as time goes by so that their meaning is changing as well. How would this affect pretest/posttest assessments? Should the SRDS be adjusted for inflation? In general, it appears that a number of the SRDS items are somewhat dated and should be revised and recalibrated using current data.
The variance explained by the CVS measure was substantial (Reckase 1979) at 45%, and the percentage of residual variance explained by the first factor of residuals was 11% (below our 15% criterion). Both of these results were supportive of the interpretation that the CVS was unidimensional, a requirement of the Rasch model.
Two CVS items had problematic fit statistics (Table 2 which also contains complete items keyed by number), 1.DiscussCalmlySettleIt with 1.52 infit and 3.01 outfit and 2.LeftRatherThanArgue with 1.32 infit and 3.36 outfit. While the latter did not quite reach the 1.33 criterion for poor infit, it was close enough to cause concern. Of course, these are the two CVS items that are not crimes. Therefore, we concluded that while they may be useful in the questionnaire to set up other items (i.e., to create a positive response bias before asking questions about stigmatizing or illegal behavior), they should be dropped from the CVS score and analysis. 29.TradedSex had the highest outfit (outfit mnsq=4.18). 15.ForgeryBadChecks was the next most misfitting item, i.e., outfit, followed by 24.ForcedSex, 27.DUI, and 16.Theft(store). These last five items had good infit. We interpreted this to mean that they were good items but that there was a subgroup of people that endorsed them without endorsing more common crimes and who may be worth studying further.
The Cronbach’s alpha (which includes extreme scores) of CVS was high (.91). The Rasch person reliability (which excludes extreme scores) of CVS was good (.81). These estimates were calculated after deleting the two misfitting items.
The figures below present easily interpretable graphs of the relationships of the various groups on the CVS items which were arranged in ascending measure order, i.e., interpreted as increasing seriousness. The data that formed these graphs are provided in Conrad et al. (2009), available at: www.chestnut.org/li/gain that contains the information to compute differences between groups on each item. In the following section, we provide some interpretation with the results, to avoid both a meaningless list and unnecessary repetition.
The items, with DIF contrast in parentheses, 5.ActuallyThrewThingAtOne (.91), 7.SlapAnotherPerson (1.29), 31.IllegalGambling (−1.41), 20.ArmedTheft(money) (−.96), and 29.TradedSex (2.19) were significantly different for males and females where a negative sign indicated the item was easier (more common) for males to endorse, and positive was easier for females to endorse. The “Gender DIF” figure (Figure 1) shows that 24.ForcedSex was easier for females to endorse than it was for males, but this is counterintuitive. It should not be easier for females to endorse this item since it is intended to indicate the crime of rape. As we noted earlier, this is historically one of the least frequent crimes of females. Therefore, the DIF value that shows endorsing 24.ForcedSex as easier for females to endorse than it is for males seemed counterintuitive. The actual item reads: “Made someone have sex with you by force when they did not want to have sex?” This may be an item that males refuse to endorse out of fear of punishment or the desire to give a socially appropriate response. In fact, 24 people endorsed Forced Sex and 15 were males. Nine females endorsed it, which was about the same percentage, both less than 1% of the respective gender. In this item, is it possible that some females interpreted the word “made” as “persuaded” or “seduced?” Would men interpret this item the same way? In any case, this item, the most misfitting crime, appears to be ineffective in that men seem to be endorsing it too little and women, perhaps, too much. The item should be revised and the revision thoroughly tested in qualitative interviews.
The more serious crimes 20.ArmedTheft(money) and 23.ArmedTheft(other), and one less serious crime, 31.IllegalGambling, were substantially easier for males to endorse. For females, it was substantially easier to endorse less serious offenses such as Slapping someone, 5.ActuallyThrewThingAtOne, and 29.TradedSex. The complete item for 29.TradedSex is: “Traded sex for food, drugs, or money?” Because 29.TradedSex was harder for males to endorse and the sample was heavily weighted with males constituting two-thirds, 29.TradedSex attained a much higher calibration than it should have, i.e., the third highest overall but second highest for males alone. In other words, since trading sex is very rare for males and males made up two-thirds of the sample, 29.TradedSex was one of the rarest crimes overall even though it was not that rare when we look at females alone, i.e., the 10th highest for females.
In Figure 2, the items 13.PropertyDamage (−.95), 15.ForgeryBadChecks (1.58), 21.FightingHitting (−.91), 24.ForcedSex (2.01), 26.Arson (−1.31), 29.TradedSex (2.52) were significantly different for adolescents and adults where a negative sign indicated the item was easier for adolescents to endorse, and positive was easier for adults to endorse. 29.TradedSex, 15.ForgeryBadChecks, and 24.ForcedSex were > 0.9 logit, i.e., easier for adults to endorse. This finding mirrors the gender DIF findings because adult females tended to endorse 29.TradedSex, 15.ForgeryBadChecks, and 24.ForcedSex. Therefore, adults and females were driving the DIF on these items. Additionally, one could make the argument that child prostitution is more serious than adult prostitution so that child prostitution would have a higher seriousness calibration than adult prostitution. Or one could argue that child prostitution is not a crime that the child is committing. Rather, with child prostitution, the child is the victim rather than the criminal. If this were the case, the question arises as to whether the item should be dropped from a child’s measure.
African Americans tended to endorse 24.ForcedSex more than the other groups, but the small numbers involved and the other problems noted above do not allow meaningful interpretation of this finding. We concluded that we could not determine any meaningful differential item functioning by race, and we dropped race from further analysis. Full details can be found in Conrad et al. (2009), available at: www.chestnut.org/li/gain.
Figure 3 and Table 4 display the interaction of gender and age DIF. Figure 3 makes it clear that the major differences in item responses are not simply by gender or by age but by the interaction of these two demographics. Additionally, the significant DIF is usually between adult females and adolescent males. Examining Figure 3 from left to right, there is no significant DIF, i.e., >0.9, for the first five items. Then the significant contrasts for adult females vs. adolescent males, where the items are easier for adolescent males, occurs for 21.FightingHitting, 10.BeatUpSomeone, 13.PropertyDamage, 22.HurtOtherNeedMedAttn, 31.IllegalGambling, 18.BreakAndEnter, and 26.Arson. The significant contrasts that are easier for adult females compared to adolescent males are 7.SlapAnotherPerson, 15.ForgeryBadChecks, 29.TradedSex, and 24.ForcedSex. Because of the small number of responses and the suspicion of refusal to answer or misinterpretation by males, we will disregard 24.ForcedSex.
We can also see in Figure 3 how the item calibrations were driven by adolescent males because they made up the largest proportion in the sample, 3,900 which was 53% in Table 4. Specifically, the dash symbol which represents adolescent males in Figure 3 rises from left to right almost monotonically which indicates that it has the greatest influence on the calibrations. In contrast, the adult female symbol, asterisk, varies the most from the adolescent males as well as from the other groups. As one would expect, adult females tended to vary most similarly with adolescent females. Additionally, we see in Table 4 that the person reliability was lowest for adult females, and the difference was greatest for adolescent males vs. adult females. The CVS had higher person reliability for adolescents (.84) than it did for adults (.74).
An important issue that was raised here was that of the construct validity of the hierarchy overall and for males and females split into adult and adolescent groups. When we examined Figure 3, we saw that adult females usually had the highest or lowest item calibration. The most extreme differences were between adolescent males and adult females. What this means is that adolescent males and adult females had very different hierarchies for crime. More to the point is that, when a sample is predominantly composed of adolescent males as ours was in this study, the measures of adult females will tend to be biased upward. This is because women’s measures will tend to be composed of less violent acts and less serious crimes such as 7.SlapAnotherPerson, 15.ForgeryBadChecks, and 29.TradedSex. On the other hand, adolescent boys will be endorsing more serious and violent crimes such as 22.HurtOtherNeedMedAttn, 20.ArmedTheft(money), and 23.ArmedTheft(other)).
A typical way to handle this is to simply say that the former three are female-biased items and drop them. If we drop these “biased” items, everyone’s measures should drop, but we would expect that females’ measures would drop more because females tended to endorse these items more. This is, in fact, what happens. When we dropped these three items, 917 (38%) of the females’ measures decreased, and 918 (18%) of the males’ measures decreased. Using a .5 decrease as an arbitrary cutoff for the purpose of comparison, 40 (1.64%) females had decreases of greater than .5 while only 18 (0.36%) males had decreases of greater than .5. By dropping items we observed that, on average, females’ criminality dropped more than males. This shows that, indeed, gender does make a difference on certain items in calculating criminality. The problem with dropping items is that the differences in the patterns of crime are real, not bias due to bad items. In other words, the items work well within groups, but not when the groups are pooled.
Another way to handle this problem without dropping the items is to anchor all of the items at their common calibration except for the three biased items (see Conrad et al. 2007 for details of this procedure). These three are allowed to “float.” Anchoring ensures that both males and females will be measured on the same scale, thereby enabling comparison. However, the three items that we judged as biased will be unanchored, so that their calibrations may be estimated separately for males and females. To do this, we must have separate runs for males and females. When we did this analysis, all of the females’ measures decreased by an average of .078 logit while all of the males’ measures increased by an average of .035 logit. While this is only about a tenth of a logit difference (.078 + .035 = .113) between men and women on average, depending on where the seriousness cutoff was set, some women would drop below that cutoff while some men would go above. Another way of saying this is that the rankings would change for some men and women whereby removing the bias against women would tend to lower their criminality scores relative to the men while men’s scores would go up relative to the women.
To recap, in either of the above scenarios, dropping items or calibrating them separately, women’s criminality measures were adjusted downward relative to men’s measures. The choice of biased items was made based on fit statistics and the item logic, i.e., no reason to assume that violent crimes should be more serious for women, so these were not dropped nor calibrated separately. The items that were judged to be biased were either dropped or calibrated separately which enabled an “objective” adjustment. In the separate calibration, the fact that most of the items were anchored created a common ruler for both men and women so that their measures could be compared even though three items were calibrated separately.
Of course, one can object to the assumptions we made. However, our point is not to insist on these assumptions but to demonstrate how to adjust measures when the judgment is made that certain items are biased or work differently by gender or other subgroup. Another potential objection is that the item hierarchies are simply different for men and women so that entirely different rulers should be used depending on gender or that separate rulers are needed based on gender and age. The problem with this argument is that most items do not appear to be biased either psychometrically or logically. Additionally, the advantage of having common items is that they may be anchored to create a common ruler so that men and women may be compared to each other in terms of criminality. If we had separate rulers, we could not use them to compare the criminality of men vs. women. As a result, there would still be potential for bias depending on how the rulers were used.
However, an observation that would support building separate rulers or adding appropriate items to the current CVS is the fact that the CVS was found to be less reliable for adults than it was for adolescents, and that it is the least reliable for female adults at .72 vs. .85 for male adolescents. This is a substantial difference which means that female adult measures have much more error in them. Of course, more error in the measures increases the likelihood of error in clinical decision-making. This may suggest the need for more qualitative cognitive work with adult women to see if the items and reliability can be improved.
The CVS is useful as a measure of the construct of crime and violence, but we found areas for improvement. Two items may be dropped from scoring, 1.DiscussCalmlySettleIt and 2.LeftRatherThanArgue since they are not crimes and had high misfit estimates as a result. The person reliability of the CVS is especially strong for adolescent males at .85 and adolescent females at .82, but it is relatively weaker for adult males at .76 and adult females at .72. The observation of different hierarchies for gender and age and their interaction is important theoretically since it helps us to understand that one size does not fit all when it comes to calibrating the seriousness of crimes for males vs. females and for youth vs. adults and for the interaction of gender and age. This finding supports the need for development of better items and concomitant measures for adults, especially for adult females.
This development of this paper was supported by the Center for Substance Abuse Treatment (CSAT), Substance Abuse and Mental Health Services Administration (SAMHSA) via Westat under contract 270-2003-00006 to Dr. Dennis at Chestnut Health Systems in Bloomington, Illinois using data provided by the following grants and contracts from CSAT (TI-11320, TI-11317, TI-11321, TI-11323, TI-11324, TI-11422, TI-11424, TI-11423, TI-11894, TI-11874, TI-11888, TI-11892, TI-11871, TI-13309, TI-13356, TI-13305, TI-13340, TI-13344, TI-13322, TI-13323, TI-13345, TI-13308, TI-13354, TI-13313, TI-14254, TI-14376, TI-14311, TI-14196, TI-14214, TI-14261, TI-14090, TI-14189, TI-14252, TI-14283, TI-14355, TI-14272, TI-14103, TI-14267, TI-14315, TI-14188, TI-14271, TI-15686, TI-15671, TI-15486, TI-15545, TI-15672, TI-15475, TI-15678, TI-15447, TI-15461, TI-15433, TI-15481, TI-15514, TI-15478, TI-15413, TI-15483, TI-15670, TI-15674, TI-15479, TI-15682, TI-15467, TI-15511, TI-15562, TI-13601, TI-13190, TI-12541, TI-00567; Contract 207-98-7047, Contract 277-00-6500), the National Institute on Alcohol Abuse and Alcoholism (NIAAA) (R01 AA 10368), the National Institute on Drug Abuse (NIDA) (R37 DA11323; R01 DA 018183), the Illinois Criminal Justice Information Authority (95-DB-VX-0017), and the Illinois Office of Alcoholism and Substance Abuse (PI 00567). The opinions are those of the authors and do not reflect official positions of the contributing project directors or government. We appreciate the editorial support provided by Jessica Mazza.