|Home | About | Journals | Submit | Contact Us | Français|
Investigation of sexual behavior involves many challenges, including how to assess sexual behavior and how to analyze the resulting data. Sexual behavior can be assessed using absolute frequency measures (also known as “counts”) or with relative frequency measures (e.g., rating scales ranging from “never” to “always”). We discuss these two assessment approaches in the context of research on HIV risk behavior. We conclude that these two approaches yield non-redundant information and, more importantly, that only data yielding information about the absolute frequency of risk behavior have the potential to serve as valid indicators of HIV contraction risk. However, analyses of count data may be challenging due to non-normal distributions with many outliers. Therefore, we identify new and powerful data analytical solutions that have been developed recently to analyze count data, and discuss limitations of a commonly applied method (viz., ANCOVA using baseline scores as covariates).
Research on sexual behavior influences public policy as well as educational, clinical, and public health practice for a diverse range of health domains, including family planning, infertility, unintended pregnancy, sexual functioning, and sexually transmitted infections (STIs). The quality of the information yielded by sexual behavior research depends on the methodological rigor of that research. Because of the private (and often stigmatized) nature of sexual behavior, the dyadic (rather than individual) aspect, the multiple motives for sexual behavior, and the large intra- and inter-individual differences in behavioral frequency, research on sexual behavior involves many challenges for investigators (1, 2).
In this paper, we address two of the challenges that researchers confront when investigating sexual behavior, namely, decisions regarding (a) the assessment of sexual behavior (i.e., item content and scaling), and (b) the analysis of sexual risk behavior data. We focus on conceptual differences and data analytical problems that distinguish counts from relative frequency measures of condom use.i We discuss these two challenges in the context of HIV research, with a focus on unprotected (vaginal or anal) intercourse because, as noted by Jemmott and Jemmott, unprotected intercourse is “the best indicator of risk of sexually transmitted infection inasmuch as it indicates the number of exposures to risk” (p. S50)(4). Our purposes are (a) to raise awareness about the need to differentiate between count and relative frequency measures, (b) to discuss options suitable for the analysis of count data, and (c) to identify needs for further methodological research.
Perhaps the most important decision that a sexual health researcher must make involves item content and scaling. Two major categories of sexual risk measures can be found in the literature, namely, count data and relative frequency measures.ii In this section, we identify options for the measurement of unprotected intercourse, review recent trends in the assessment of sexual risk behavior, and discuss the rationale and utility of the most common measurement approaches.
Count measures and relative frequency measures are two distinct categories of sexual risk behavior measures. Most (but not all) assessment methods can be subsumed under these two major categories.iii We define and discuss each of these measures.
Theoretically, count items represent measures of discrete events on a ratio scale. Count measures ask participants to indicate the exact number of times they engaged in a sexual risk behavior during a specified period of time. Items assessing counts typically employ an open response format and often assess unprotected intercourse with two or more related questions. For example, the respondent may be asked “How many times did you have vaginal sex during the past three months” and “How many of these times did you use a condom?” The number of unprotected vaginal intercourse occasions is then computed as the difference between the total number of vaginal intercourse occasions and the number of times condoms were used. Alternatively, count measures can be collected by diary or Timeline Followback (TLFB) methods, which assess sexual risk behavior on event level. That is, each single event is recorded as either protected or unprotected, and count measures are derived by summing all occasions of unprotected intercourse that a person reports.
The primary disadvantage of count data is the usually extreme deviation from a Gaussian distribution, creating difficulties for data analysis. A secondary disadvantage of count data is that, when derived from event-level data (i.e., those obtained with diary, TLFB, or similar event-by-event reporting techniques), such data can require more time to obtain compared to measures that ask for an overall estimate of frequencies about a specified time period only.
Relative frequencies of unprotected intercourse emerge from four kinds of measures: (a) proportions, (b) percentage ratings, (c) categorical measures, and (d) dichotomies. The common feature shared by these measures is the assessment of unprotected intercourse relative to the total number of intercourse occasions.
Proportions or percentages are derived from count data; however, they are relative frequency measures because they represent the ratio of protected or unprotected intercourse to the total number of intercourse occasions. The proportion of condom-protected vaginal intercourse is computed as the number of condom-protected intercourse occasions divided by the total frequency of vaginal intercourse during the reference time interval. For example, a person reporting condom use on 5 of 15 occasions of intercourse would receive a value of .33 on the proportion scale, or a value of 33% on a percentage scale. Proportions derived from counts may also deviate from normal distributions. Occasionally, a negative kurtosis emerges with high frequencies on both ends of the distribution and low frequencies between the values of 0 and 1.
Percentage ratings do not originate from count data but emerge when respondents rate the use of condoms on a percentage scale. Subjects may be asked: “How often did you use condoms when you had sex in the past three months?” They may, for example, respond to an 11-point scale ranging from 0 to 100 percent in ten percent increments. Percentage ratings are estimates only and provide ordinal data in form of ordered categories. They, too, may be affected by skewness or negative kurtosis, with high frequencies on one or both ends of the distribution and lower frequencies in the intermediate categories.
Categorical measures of relative condom use comprise Likert scales and dichotomous measures. Both yield information similar to proportions and percentage ratings in that they ask participants to report the frequency of condom use relative to the frequency of intercourse. Likert scales work with multiple response options, such as using condoms “every time,” “sometimes,” or “never.” Regardless of the number of response options employed, the typical feature of Likert scales for safer sex is a range from “never” to “always.” Likert-type categories can be derived from proportions as well by dividing the sample into those who report condom use in 100% of their sexual encounters (always), those who report 0% condom use (never), and those who have percentages >0 and < 100 (sometimes), which may be divided further into two or more categories, such as < = 50% and >50%. An advantage of Likert ratings is their greater approximation to a Gaussian distribution. Such data usually do not require transformation, and their more favorable distribution allows researchers to apply parametric significance tests, which tend to be more powerful than non-parametric analyses. Two disadvantages are their lower precision and their potential limitation as an indicator of sexual risk behavior, two issues that we discuss in more detail later.
Dichotomous measures are similar to ordinal measures, reduced to two categories. For example, a common dichotomy identifies “low-risk” individuals as those respondents who use condoms consistently (“always”) versus “high-risk” individuals who use condoms inconsistently or not at all (“not always”). Dichotomies can also be derived from ordinal data or count measures by data reduction. They have to be regarded as measures of relative condom use as well, as they provide information only about condom use relative to the total number of intercourse occasions. Dichotomous measures facilitate group comparisons (e.g., using odds ratios) to explore or test hypotheses regarding the correlates of high-risk sexual behavior. The primary disadvantage of dichotomous measures is the loss of quantitative information. Dividing a sample into two groups often leads to a heterogeneous group of “high-risk” individuals, including – for example – those who (a) have used condoms all but one time, (b) never use condoms because they believe they are in a mutually monogamous relationship, and (c) engage in extremely high-risk activities such as sex trading with multiple partners without using condoms. Even if sub-samples are analyzed separately, the results are still based on a rough measure of HIV contraction risk, reduced to two categories only.
Given the many assessment options, investigators must decide whether to use count data, relative frequency data, or both. This choice is typically guided by factors such as the goals of a study, the level of precision desired, anticipated data collection costs, data analytical considerations, and feasibility. In order to determine how the choice of assessment methods is related to study goals, we reviewed studies published between 1995 and 2001. We searched PsycINFO and Medline using as keywords “HIV or AIDS” and “condom use” or “unprotected intercourse.” Further, we searched in the reference sections of identified studies and included publications from 1995 or later that matched the purpose of this review. We selected all studies that analyzed self-reported condom use as a distinct measure of sexual risk behavior. We excluded studies that did not (a) analyze condom use as a separate outcome (i.e., those studies that used unprotected intercourse only as part of a composite measure), (b) focus on condom use or sexual risk behavior as a primary outcome of interest, or (c) provide enough information to allow us to categorize the measure as a count or a relative frequency measure.
Table 1 lists the resulting sample of 116 studies that illustrate the range of measures used to assess unprotected intercourse (5–120). We classified the studies according to their purpose as (a) intervention studies, (b) correlational studies (i.e., non-experimental research investigating relationships between risk behavior and potential predictors), or (c) methodological studies (testing the reliability and validity of condom use data; see column b). We also distinguished among studies analyzing count measures, proportions, percentage ratings, and categorical/dichotomous measures (see column c).
Of the 116 studies listed in Table 1, the majority (n = 74) relied exclusively on relative frequency data, most often in the form of categorical or dichotomous data. Count data were also used commonly, either alone or in combination with proportional measures (n = 42). A closer look at the relationships between item category and study goals (see Table 2) reveals that count data were employed more often in intervention research and methodological studies (n = 34, 81%) and less often in correlational studies (n =8, 19%). In contrast, relative frequency data were used more often in correlational studies (n = 50, 68%) rather than intervention or methodological investigations (n = 24, 32%). Thus, the use of relative frequency and count data differs by study type, χ2 = 25.23, df = 1, p < .001.
In comparison to earlier reviews (121–123) and a recent meta-analysis on correlates of condom use (124), the current review indicates increased use of count measures and proportions since 1995 (121, 122, 124, 125). In their meta-analysis, Sheeran et al. (124) identified 121 studies published between 1981 and 1996, which focused nearly exclusively on categorical and relative frequency measures of condom use. In contrast, the current review reveals that approximately 36% of the studies used counts, or a combination of counts and proportions. This trend might be interpreted as reflecting an increasing need for precision for HIV risk behavior research, particularly in intervention and methodological research. This pattern also suggests an emerging consensus that relative condom use measures may not suffice to inform about the extent of sexual risk behavior before and after a treatment and may not be useful to evaluate intervention success. We will discuss these hypotheses in detail in the following section.
The increasing interest in counts in intervention and methodological research, and the frequent use of categorical data / percentage ratings in correlational research, require explanation. The aforementioned trend towards including count measures has not affected all types of studies equally, suggesting that counts are not universally regarded as the more useful measures. In this section we discuss the utility of count and relative frequency measures relative to study goals. First, we compare these measures as indicators of HIV contraction risk and markers of HIV intervention success. Second, we discuss their utility in studies testing theoretical models of health behavior.
The most crucial question to be answered in intervention research is whether the treatment reduces effectively the frequency of exposure to unprotected intercourse in the target population. If we agree that the focus in intervention studies needs to be the reduction of risk behavior rather than a mere increase in condom use, then the most important criterion in evaluating treatment effects needs to be a decrease in the absolute number of risk exposures. Relative frequency measures of condom may not suffice to evaluate intervention success. We argue that (a) relative frequency measures are usually imprecise indicators of HIV contraction risk, (b) count data yield important and non-redundant information, and (c) results obtained with relative frequency measures may not be generalized beyond the limited information that they provide. Further, we discuss (d) situations in which relative frequency measures may be useful in intervention research.
Several recent publications indicate an increasing consensus that count data are needed in order to evaluate intervention effectiveness. According to Jemmott and Jemmott (4), unprotected intercourse is “the best indicator of risk of sexually transmitted infection inasmuch as it indicates the number of exposures to risk” (p. S50). Similarly, Jaccard concludes that “from a public health perspective, a major criterion of interest is the sheer number of instances of unprotected sexual intercourse that occurs in a population over a given time period”(50). Further, in a recent paper discussing biological and behavioral markers of intervention success, Fishbein and Pequegnat (126) come to the same conclusion, stating that “if one is truly interested in preventing disease or pregnancy, it is the number of unprotected sex acts and not the percentage of times condoms are used that should be the critical variable.” In general, the risk of HIV infection that an uninfected person takes increases as a function of the number of times this person exposes him- or herself to unprotected sex with an infected partner, all other factors held constant, as each single event adds to the risk of HIV contraction. If we control for co-factors such as the amount of risk behavior displayed by a sexual partner, infectiousness of a seropositive partner at a given time, a person’s biological vulnerability, and viral load, then the likelihood of becoming infected with HIV is proportional to the number of times he or she is exposed to the virus.
The need for count data in evaluating HIV contraction risk and intervention success becomes most apparent in formulas that try to quantify factual HIV risk based on absolute frequencies of unprotected intercourse. For example, the Vaginal Episode Equivalent (VEE) Index developed by Susser et al. (127) assumes that counts of unprotected intercourse are the best indicator of HIV contraction risk, which is expressed by giving weight to each single occasion of unprotected intercourse depending on the relative amount of risk connected with unprotected vaginal (ω=1), anal (ω=2), and oral (ω=.1) intercourse. Similarly, mathematical models for the prediction of HIV contraction risk and the spread of HIV, such as the Bernoulli Process Model (128, 129), require highly accurate and detailed information, including the exact number of times a person engages in unprotected sex; such data can be obtained only from count measures of sexual risk behavior. The same applies to the evaluation of HIV risk reduction programs and estimates of cost-effectiveness of interventions.
Accordingly, the primary task of HIV risk behavior interventions has to be seen in reducing the risk of HIV contraction, which needs to be evaluated by comparing the absolute frequencies of engagement in risky behaviors before and after treatment. Although relative condom use measures may supplement count measures in intervention research, we argue against the reliance on relative frequency measures in intervention studies for the following reasons.
An important disadvantage of relative frequency measures of condom use is that they usually do not inform about the absolute frequency of unprotected intercourse, as the following example will show. Imagine two cases: Person A reporting two occasions of sexual intercourse, and once using a condom; Person B reports 100 occasions and condom use during 50 of those. In a count approach, A has a risk score of 1, B a risk score of 50, reflecting the fact that B had 50 exposures to infection risk and thus behaved 50 times riskier than A. In terms of proportions, however, both A and B get the same value because each has had unprotected sex in 50% of their sexual encounters. The same applies to categorical measures of relative condom use. On a three-point scale, for example, ranging from “never” to “sometimes” and “always,” both persons are assigned to the same middle category (for a similar example, see Fishbein and Pequegnat (126). This lack of precision in the evaluation of HIV contraction risk argues against the use of relative frequencies as quantitative measures of sexual risk behavior.
This disadvantage of relative frequency measures may account for the fact that only about 50% of the intervention studies analyzing count data report proportions of condom use as a supplemental criterion of treatment effects. Thus, counts or total frequencies of unprotected intercourse seem to be regarded as the primary outcome of interest (18, 55, 62, 63, 69, 101, 105, 116, 130).
Further, as Table 1 reveals, authors using count data rarely report proportions of unsafe sex; instead, if proportions are analyzed, the proportion of protected intercourse is reported. Statistically, it should be irrelevant whether the proportion of protected or unprotected intercourse is reported, as the magnitude of one determines the other. The preference of proportions of protected intercourse may reflect the conviction that count data of condom-protected intercourse yield valuable information only relative to the total frequency of intercourse. In contrast to unprotected intercourse, count data of protected intercourse are not very informative without comparison to the total number of sexual events. If we know that person A reported 15 occasions of condom-protected intercourse, we still do not know what this means in terms of the consistency of A's self-protective behavior. However, if we know that A uses condoms 75% of the time, we have an idea how condom use relates to the overall sexual behavior of A. Thus, whereas count measures of unprotected intercourse are better indicators of HIV contraction risk, relative condom use measures may be the better choice in evaluating condom use as a habit, that is, the conditional likelihood that a condom is used if a person engages in sexual intercourse.
In principle, relative frequency measures of condom use could substitute for count measures of unprotected intercourse in intervention research if we could rely on a strong association between the two measures. However, there exists no mathematical reason to assume a consistently high correlation between counts and relative frequency measures. Further, a variety of factors (e.g., partner type) are likely to influence the association between counts and relative frequency measures. Therefore, before results obtained with relative condom use can be generalized to counts of unprotected intercourse, the correlation between these two measures has to be determined. This, however, requires the simultaneous assessment of absolute frequencies – counts – of sexual risk behavior.
The uncertain association between counts and relative frequency measures can be demonstrated, as the following examples show. First, we created the hypothetical data depicted in Table 3 to illustrate how the correlation between an absolute and a relative frequency measure might range from −1.00 to 0.00 to +1.00. Second, using actual data from our ongoing and published studies (18, 19, 21), we calculated correlations between counts and proportions of protected and unprotected vaginal intercourse. Proportions were further transformed in various categorical condom use measures. Pearson correlations between (normalized) count data of unprotected intercourse and the various relative condom use measures ranged from r = −.23 to r = −.78, and the correlations between (normalized) counts of condom-protected intercourse and relative condom use ranged from r = −.04 to r = .82. Our conclusions regarding the divergence of relative and absolute frequency measures are further supported by several previously published studies reporting only low to moderate associations between these measures. Fishbein and Pequegnat (126) found correlations between r=−.20 and r=−.40 for the number of unprotected sex acts and the percentage of condom-protected intercourse, and O’Leary et al. (86) reported a correlation of r = .01 between the frequency of unprotected intercourse (i.e., counts) and the proportion of occasions in which a condom was used. In addition, several counter-intuitive findings in the literature provide evidence that caution is warranted in generalizing results obtained on the basis of the relative frequency of condom use. In a recent meta-analytic study, Sheeran et al. (124) found an overall negative correlation of r = −.18 between the frequency of condom use and the frequency of sexual intercourse based on a sample of studies that usually employed measures of relative condom use (categorical measures, dichotomies, percentage ratings). This result is extremely unlikely to occur with count data because a person with few occasions of sexual intercourse has fewer opportunities to display safer sex than a more sexually active person. Similarly, Kasprzyk et al. (65) reported an insignificant negative correlation of r = −.16 between unwanted pregnancy and the relative frequency of condom use in vaginal sex but a strong negative correlation of r = −.40 between the relative frequency of condom use in anal intercourse and unwanted pregnancy. Again, this finding is unlikely to occur if count measures of unprotected intercourse are used. Overall, then, this pattern of results indicates that relative frequency measures of condom use cannot be used as proxy for count measures.
Although our general perspective is that count data provide a more sensitive measure of HIV contraction risk and should, therefore, be preferred in HIV intervention research, we do not wish to imply that relative frequency measures of condom use are incapable of indicating HIV infection risk. Indeed, several partner studies reported a dose-response relationship between categories of condom use (e.g., “never,” “sometimes,” and “always”) and HIV seroconversion. For example, a study involving commercial sex workers in Kenya revealed an association between a categorical measure of condom use and HIV seroconversion (131). Similarly, a meta-analysis of HIV serodiscordant partner studies revealed a dose-response relationship between 3 categories of condom use (“never,” “sometimes,” and “always”) in predicting seroconversion (132). However, because these studies did not provide information about absolute frequencies of risk behavior, it is not possible to judge how relative condom use measures compare to count data as indicators of risk for HIV contraction. Indeed, we would assume a relationship with HIV contraction risk for any measure that distinguishes between “consistent” condom users and “inconsistent” or “never” users. However, the fact that such a relationship exists for categorical measures does not demonstrate that relative frequency data are superior to count data as indicators of HIV contraction risk.
Even with this evidence as background, the fact remains that relative condom use measures do not, by themselves, yield sufficient information about the extent of individual risk behavior, as our former examples have demonstrated. Relative condom use measures may provide valuable information about the extent of risk behavior if combined with background information about the absolute frequency of engagement in sexual intercourse. For example, in a homogeneous sample of highly active sex workers, relative condom use measures are highly informative about the extent to which each individual engages in unprotected intercourse. However, homogeneity of behavior frequencies cannot be assumed without testing, which again points to the need for count data in evaluating intervention success. In intervention research, relative frequency measures of condom use may be employed in combination with information about absolute frequencies of intercourse. However, we discourage the use of relative frequency measures if they are not indicative of the absolute frequency of sexual risk behavior.
One final example may demonstrate the problem inherent in the exclusive use of relative condom use measures. In principle, it is possible that HIV risk interventions induce participants to engage more often in sexual activities. In this case, even if the relative frequency of condom use may be enhanced after treatment, there is still the possibility that the overall frequency of unprotected intercourse remained unchanged or has increased at the same time. Thus, without any additional background information about absolute frequencies of intercourse or unprotected intercourse (i.e., counts), relative frequency measures of condom use cannot be recommended for the evaluation of intervention success.
In sum, our review of intervention and methodological research indicates a recent trend toward inclusion of count measures, particularly in intervention trials. This trend reflects the view that studies focusing on unprotected intercourse as the main outcome and as quantitative indicator of individual HIV contraction risk require information about the actual numbers of risk behaviors enacted. The lack of conceptual and consistent empirical overlap between counts and relative condom use measures leads to the conclusion that results obtained with measures of relative condom use should not be generalized to research programs employing count measures without assuring that a strong empirical relationship between the two measures is present in the respective population.
Although count data of unprotected intercourse appear to be more sensitive indicators of HIV contraction risk, ratings of relative condom use have been preferred in correlational studies modeling safer sex in HIV risk populations. Counts measures of unprotected intercourse are particularly unlikely to be used as outcome in studies testing theoretical models of health behavior (5, 16, 34, 44, 65, 82, 84, 91, 97, 99, 104, 120, 133). For example, Bryan, Fisher, Fisher, and Murray (16) tested the Information-Motivation-Behavioral Skills (IMB) model of Fisher and Fisher (134) using a percentage rating of condom use as outcome. Bryan, Aiken, and West (15) and Thompson, Anderson, Freedman, and Swan (111) tested the prediction of safer sex, using percentage ratings of condom use as the criterion. Schroder, Hobfoll, Jackson, and Lavin (99) included ratings of relative condom use as one of the outcomes to test a model of safer sex among African- and European-American women. In their meta-analysis, Sheeran et al. (124) found strong correlations between psychosocial predictors and condom use, with almost all of the studies using percent ratings, categorical, or dichotomous measures of relative condom use.
One explanation for the apparent preference of relative condom use measures in model testing research is that these measures may reflect a “latent disposition” to use condoms rather than being a precise indicator of behavior frequencies. If interpreted (broadly) as latent tendencies towards safer sex, relative frequency measures can be expected to relate more strongly (than would count measures) to social-cognitive predictors. A person who reports always using a condom is more likely to believe in their effectiveness and utility, and less likely to anticipate negative effects of condom use. In testing such hypotheses it is most reasonable to conceptualize safer sex as a “habit,” which may be indicated most precisely by the conditional likelihood of condom use. In contrast, the absolute frequency of condom use may provide less information about the motivational basis of a person’s sexual behavior. This is because the absolute frequency of intercourse as measured by counts is a function of many other individual and dyadic factors (e.g., opportunity, drive, social skills, time, general attitudes towards sex).
Although relative condom use may qualify as the better indicator of a “latent likelihood of safer sex behavior” and thus be preferable in model testing research, the question remains whether results obtained with these measures can be generalized to research programs targeting count measures of sexual risk behavior. Our former discussion of generalization issues does not support this notion. For that reason, we hope that future models testing studies take the challenge and test their predictive power simultaneously using count measures as outcomes. Model testing results obtained with counts of unprotected intercourse may offer a more appropriate empirical reference frame for theory-based HIV prevention programs that employ these measures as primary criterion for intervention success.
Analysis of sexual behavior data, especially count data, can be challenging even for experienced investigators. Next, we identify a variety of options for the analysis of count data and discuss the advantages and disadvantages of each approach. We focus on methods available for the analyses of counts. Count data of sexual behaviors are often characterized by extreme skewness, variance, and kurtosis, thus deviating strongly from normality. The analysis of count data requires a number of difficult decisions (e.g., how to define outliers in a given distribution or identify the most appropriate and powerful analytical strategy) for which there is no single “correct” choice. Solutions for the analysis of non-normal count data strive to balance the weights of high- and low-frequency cases, and to reduce the biases introduced by extreme skewness and variance.
Our goals for in this section are (a) to raise awareness about the variety of data analytical options, and (b) to identify alternatives to traditional statistical methods. Because the need for count data is particularly apparent in intervention research, the following review focuses on methods suitable for the analysis of randomized controlled trials (RCTs). Further, because results with relative frequency measures of condom use cannot be generalized to counts of unprotected intercourse, we call for model testing research using count measures of sexual risk behavior. For that reason, the following overview includes correlational analytical methods as well. The list of data analytical options we discuss cannot provide the kind of in-depth discussion that can be provided in books devoted entirely to the analytic approach, so we refer interested readers to such sources throughout (135–137).
As a general organizing framework for the following review, we refer to the Generalized Linear Model (GLM). The GLM provides a unifying approach to regression and experimental designs for both linear and non-linear regression models as well as normally and non-normally distributed data. The GLM requires that the distribution of the outcome belong to the exponential family of distributions, such as the Gaussian, binomial, Poisson, inverse normal, negative binomial, exponential, and gamma distributions (138). In our discussion, we overview (a) linear models that require normal distributions, (b) generalized linear regression models for the analysis of non-normal count data, (c) non-parametric data analytical options, and (d) distribution-free solutions developed specifically for non-normal interval- or ratio-scaled outcomes.
Linear models usually assume interval level data, normal distributions, and homoscedasticity, and lead to exact significance levels only under these conditions. Count data of sexual risk behavior may occasionally approximate these assumptions and assume a normal distribution if assessed in a homogeneous group of highly sexually active individuals (e.g., sex workers) in which zero-counts and extreme outliers may occur less often. However, behavioral count data often violate the assumptions of linear parametric tests. Data transformations may provide an approximation to a normal distribution but may not suffice due to some extreme outliers. Thus, if count data deviate from a normal distribution, usually two procedures precede linear parametric analyses: (a) data transformations, and (b) the treatment of outliers. Therefore we begin by commenting on these two preliminary steps that may be needed prior to using linear models for normally distributed data.
Data transformation aims at an approximation of non-normal data to a Gaussian distribution. However, with counts of sexual risk behavior characterized by a high number of zero-counts, data transformation may not suffice. Often it is assumed that violations of normality after data transformation may be tolerated if the sample is large; however, this assurance does not apply to the analysis of rare outcomes with a majority of zero-counts. One solution is the two-step approach used by Carey et al. (19). In this study, participants who reported no risk (i.e., never having unprotected sex) were first compared with participants who reported any unprotected intercourse using logistic regression. In a second step, the group reporting any unprotected intercourse was analyzed further, excluding non-risk cases.
The most common transformation applied in HIV research is the log10 (x + 1) transformation (e.g., see studies 18, 61, 69 in Table 1). However, log10 (x + 1) transformations do not repair all skewed distributions. The NIMH Multisite Prevention Trial Group (139) used square root transformations, and O'Leary et al. (86) applied a cubic root transformation in order to approximate their data to a normal distribution. In general, there are no universal transformation for normalizing extremely skewed behavioral count data. The degree of approximation to a Gaussian distribution that can be accomplished by diverse transformations may be tested using the Kolmogorov-Smirnov test or similar statistics. Alternatively, the most effective approximation may be determined by the Box-Cox power transformation (implemented, for example, in SAS, Stata, and Statistica), which iterates the response variable in a regression model through a series of power functions until normality is maximized. Thus, the Box-Cox method identifies the optimal transformation parameter for the dependent variable that improves most successfully the fit of the specified linear regression model.iv,v
It is sometimes recommended to either exclude extreme outliers or to reduce their impact by assigning a limit value. Tabachnik and Fidell (140), for example, recommend using a z-score of >=+3.29 (or 3.40, one-sided) for the definition of univariate outliers, which equals a likelihood of p < .001. For extremely skewed distributions, we recommend defining and treating outliers on the basis of previously normalized scores. In linear parametric analyses, only the distribution of the normalized scores is relevant and should not yield any extreme outliers.vi
Multivariate outliers (i.e., cases with an unusual combination of scores) can be identified using regression procedures that offer outlier diagnostics such as the Mahalanobis distance (i.e., the deviation of a case from the centroid of the remaining cases established by the means of all variables involved), Cook’s distance, and the leverage of a case; the latter provide information about the influence of this case on the regression coefficient. Cut-off scores recommended for the definition of multivariate outliers based on regression diagnostics can be found in Tabachnik and Fidell (140). Again, outlier diagnostics should be performed using the normalized scores of non-normally distributed count data, because these statistics are based on linear regression procedures and apply to Gaussian distributions. The methods for defining and treating outliers in behavioral count measures have to be chosen with care. Dependent on the distribution of the target behavior in the population, the existence of unusual and extreme cases has to be expected as a naturally occurring event. Consequently, the elimination of outliers may satisfy statistical needs at the expense of losing valid cases with the greatest risk of HIV contraction. For that reason, we cannot recommend the removal of outliers as a general strategy. However, there are exceptions to consider. If, for example, test-retest correlations of behavioral measures produce an extreme outlier in the bivariate distribution, indicating that an individual is either not able or unwilling to provide reliable reports, the removal of this bivariate outlier, using Mahalanobis distance, Cook’s distance, and leverage statistics as criteria, might be preferable to accepting an overly influential and unreliable case. Decisions regarding the elimination of outliers should be made based on validity considerations and should not be misused as a strategy to improve the psychometric properties of a measure or the outcomes of an intervention. The removal of an outlier needs to be reported together with evidence for the unreliability of the participant (20).
Alternatively, the impact of outliers can be reduced (winsorized) by assigning a limit value (e.g., a z-score with p <=. 001) in the normalized distribution. Winsorizing of extreme outliers reduces their disproportional weight and is more likely to preserve the results for the majority of the sample. An additional justification for a treatment of single outliers may be that outlier scores are likely to yield the highest measurement error. Thus, it would be preferable to weigh an outlier less than a lower but more reliable score. The treatment of outliers is usually used in combination with other strategies for the analysis of skewed distributions (data transformations, non-linear analyses).
Two potential problem may occur by assigning a limit score to the upper end of the distribution First, relationships with connected variables (e.g., proportions of (un)protected intercourse, or behavior change in a longitudinal design) may not be preserved for the respective outliers. Outlier reduction requires decisions regarding the adjustment of related behavioral scores (e.g., how to adjust a score for protected intercourse if unprotected intercourse is set to a limit value, or whether and how to adjust post-intervention scores if a person’s pre-intervention scores were truncated). If outliers in the frequency distribution of unprotected intercourse need to be reduced, we recommend computing proportions of (un)protected intercourse prior to any treatment of outliers in order to preserve the information about relative condom use for those cases.
Second, statistical outlier diagnosis can indicate a high number of extreme values, specifically in distributions of behaviors that are uncommon in the target population. Reducing all statistically identified outliers to a defined limit value may lead to a bimodal distribution that has no advantages compared to the original distribution of the data.
In sum, the reduction of single outliers can be helpful if these cases exert a strong bias on study findings (i.e., if the results differ considerably depending upon whether the outlier is included or not). The treatment of outliers should not be overused. Instead of reducing a high number of statistically defined outliers, it may be preferable to define outliers more liberally than usually recommended, perform distribution-free statistical tests, use robust estimation, or analyze the data on ordinal or categorical level only (discussed later). Decisions regarding the treatment of outliers remain difficult and need to be reflected for each single case. If there is reason to believe that an outlier involves a disproportional large measurement error (i.e., extreme over-reporting), treatment of this outlier (removal, reduction) is indicated. If an extreme outlier may be valid but merely indicate an extremely unrepresentative case (e.g., a single sex worker in the sample), decisions regarding this outlier should consider the question whether and to what extent its inclusion or exclusion may bias inferences regarding the target population. A single overly influential case that outweighs the majority of other scores in the sample may be removed in order to come to valid conclusions in the remainder of the sample. Such a decision implies that the case is a member of a specific, unrepresentative sub-population, which needs to be addressed in a different study. The removal of this outlier requires explicit statements regarding the limited generalizability of the results. In test-retest reliability studies, an extreme bivariate outlier seems to indicate unreliability of an instrument that may otherwise provide highly reliable and valid results. Because decisions regarding this case have immediate effect on instrument evaluation, we would recommend reporting both the results with and without the unrepresentative case (20).
Once an approximation to a normal distribution has been achieved, almost all linear parametric analyses can be applied. Instead of listing the many well-known data analytical options for normally distributed variables, we will focus in the remainder of this section on specific problems connected with the analysis of longitudinal data from RCTs. Our purposes are to discuss the available options and to identify specific problems regarding the use of repeated measures analysis of covariance (ANCOVA), which is still widely used to analyze intervention effects.
Randomized controlled trials with a pretest posttest control-group design are traditionally analyzed with Analysis of Variance (ANOVA) if the outcome variable approximates a Gaussian distribution. Even with non-normal count data, data transformations and subsequent ANOVA may be preferred over alternative methods (a) because of their simplicity, transparency, and ease of interpretation and (b) to perform multivariate analyses for which no equivalents exist among non-linear or non-parametric methods. Multivariate Analysis of Variance (MANOVA) offers a solution for multiple criteria of intervention success such as (a) an estimate of overall treatment effects without inflating the Type I error probability; (b) the possibility to find overall treatment effects even if each single dependent variable fails to indicate intervention effects; (c) accounting for correlations among the outcome variables, which are ignored in univariate analyses, and (d) reducing the number of outcome criteria by grouping them according to hypotheses or outcome types into a smaller number of multivariate outcome sets (141).
Controversy exists regarding the validity of results obtained with repeated measures ANOVA, difference score analysis, and Analysis of Covariance (ANCOVA) using the pretest scores of the outcome as covariate (142–148). In the following discussion, we challenge the notion that difference scores are biased indicators of change; instead, we claim that the analysis of residuals in ANCOVA models, using pre-intervention scores of the outcome as covariate, leads to biased results in RCTs and do not provide a sensitive approach to the question of behavior change. We will discuss these three topics in turn, beginning with change score analysis and repeated measures ANOVA, and ending with a critique of the ANCOVA approach.
In repeated measures ANOVA, the outcome of interest is the treatment-by-time interaction. A difference between the experimental and control group is expected after, but not before, the treatment. Alternatively, difference (or change) scores can be computed between pre- and post-intervention scores, using one-way ANOVA (or t-test) in order to test the main effect of groupvii.
It can be shown that, in a simple pretest-posttest design, the group-by-time interaction in repeated measures ANOVA leads to the same results as the main effect of group in analyzing change scores by one-way ANOVA or t-test (144, 145). However, current HIV intervention research often employs multiple post-intervention assessments to evaluate treatment effects over time. For analyses with 2 or more post-intervention assessments, the results of repeated measures ANOVA and difference score analyses will not be the same. In a multiple post-test design, the computation and analysis of difference scores, using repeated measures ANOVA, may have important advantages over pretest-posttest repeated measures ANOVA. First, in pretest-posttest ANOVA, group-by-time interactions are likely to become increasingly inflated with an increasing number of post-intervention assessments. Second, group-by-time interactions may not be clearly interpretable as they confound treatment-group interactions with interactions between group and post-intervention development. In contrast, computing difference scores between each single post-intervention assessment and pre-intervention scores allows clear interpretation of effects: Group differences in the change scores are interpretable as treatment effects, time effects indicate post-intervention change over time, and group-by-time interactions can be clearly interpreted as differential post-intervention change in the two groups.
In the following discussion, we focus on a comparison between difference score analysis/repeated measures ANOVA and analysis of covariance (ANCOVA) controlling for pre-intervention scores of the outcome. Difference score analysis and repeated measures ANOVA have often been displaced by ANCOVA due to a widely accepted critique of a bias inherent in difference score measures (142–145). We challenge this critique and question instead the validity of the ANCOVA approach.
Although the difference score is an unbiased estimate of true change and follows a simple compelling logic, it has been applied infrequently due to criticism regarding (a) the unreliability of difference scores, (b) the correlation between difference scores and pre-measures, and (c) the claim that difference scores do not account for both an imperfect relationship between pre- and post-measures and the likelihood of a regression to the mean (142, 143). We address each of these criticisms briefly.
First, difference scores are said to be unreliable because they sum measurement errors yielded in both pre- and post measures. Instead, the application of ANCOVA has been recommended, using pre-intervention scores of the dependent variable as a covariate. The ANCOVA approach attempts to control for “error” in terms of pre-existing individual differences. The part of the variance that can be explained by pre-intervention scores is removed, and the remaining residuals are analyzed for treatment-induced behavior change. The residual scores result from linear regression and represent a combination of “true change” in the relative position in the distribution of the scores and an unexplained “rest” (i.e., measurement error, which should occur at random).
Although this criticism is widely cited, there is little support for the claim that residuals contain less error then change scores: Predictions of the true scores are based on the correlation between pre- and post-measures, which is strongly affected by the error in both measures. That change scores are not necessarily less reliable than residuals has been demonstrated by Llabre, Spitzer, Saab, Ironson, and Schneiderman (146); these authors showed that difference scores may be of similar and even higher reliability than residuals. Llabre et al. (146) as well as Rogosa, Brandt, and Zimowski (147) argue that the (statistical) reliability of difference scores is not a trustworthy indicator of the upper limit of their validity (see also Malgady (149)). Imagine a simple pre-post intervention design that is 100% successful in reducing risk behavior, eliminating the variance in the post-intervention scores to zero. In this case, the retest reliability of the scores (and thus the statistical reliability of the difference scores) also will be zero although the pre- and post-intervention scores and their difference may be perfectly accurate.
A second criticism of the difference score approach involves the correlation between pre-measures and difference scores. Cohen and Cohen (142) claim that a good indicator of change would remove pre-existing inter-individual variance in the true scores and, thus, should be uncorrelated with the pre-measures (assuming that measurement errors occur at random). However, this claim seems unreasonable and may not be applicable to intervention research, in which the measurement of change is most crucial. On the contrary: Because we are interested in the reduction of a behavioral outcome variable that has an absolute limit of zero (no risk behavior), we can assume a correlation between pre-measures and difference scores; it seems illogical to expect that the possible range of risk behavior reduction could be the same for high-risk and low-risk individuals. Thus, a good indicator of change should be allowed to correlate with the pre-measure.
The third criticism is that difference scores do not adjust for an imperfect correlation in defining the unit of change. Cohen and Cohen claim that “the trouble with using the simple change score is that is presumes that the regression of x on y has a slope of 1.00 instead of the actual Bxy”, which may be a regression coefficient of .60 instead of 1.00 (142). This means that one unit in x would only account for a .60 unit in y. However, this criticism does not seem valid for two reasons. First, difference scores make no such assumption of a perfect correlation between pre- and post-measures; they merely indicate change without any regard for a possible correlation between pre- and post-measures. Second, in behavioral research, one behavioral unit (e.g., one occasion of unprotected intercourse) and the related amount of risk a person takes in performing such behavior indicates the same risk at both pre- and post-intervention, regardless of the variance in both measures or an imperfect relationship between pre- and post scores as expressed by a beta-coefficient < 1. In principle, scores indicating behavioral change may be perfectly accurate irrespective of an imperfect correlation between pre- and post measures. An imperfect correlation, in turn, is not a valid indicator of unreliability of either pre- or post-measures.
In sum, the rejection of absolute change scores results from an invalid application of regression theory to the difference score approach. It seems inappropriate to reject difference scores on account of regression theory because (in contrast to residual change score analysis) they are not based on the regression paradigm and thus do not need to satisfy its assumptions.
Based on our arguments in defense of difference scores, one might question whether the analysis of residuals can be regarded as an appropriate method of analyzing behavior change. Maris (148) criticized the use of ANCOVA in RCT studies. He argues that covariance adjustment is necessary if group assignment is based on pre-intervention scores; however, in RCTs, covariance adjustment leads to biased estimation of intervention effects, whereas difference scores are an unbiased estimator of change.
We agree with Maris’ critique for the following reasons. First, residuals of baseline scores do not exactly reflect “true change” within the unexplained rest left after removing the “true score variance.” This is because the pre-post correlation of the scores cannot be interpreted as stability coefficient, due to the change induced in the treatment group. Further, change in the treatment group may not follow the assumptions of a linear increase or decrease as assumed by the ANCOVA approach. Thus, the correlation between pre-and post-test is likely to produce incorrect estimates of true individual change.
Second, ANCOVA assumes the regression coefficients to be the same in the groups to be compared (141). However, this assumption contradicts the treatment hypothesis in RCTs: Stability (and thus, a strong correlation between pre and post scores) is expected in the control group only; the treatment group is expected to show greater behavior change, meaning that the correlation with the pre-scores will be reduced. Thus, for both theoretical and empirical reasons, ANCOVA and similar methods accounting for the correlation between pre- and post-measures cannot be recommended for the analysis of RCT outcomes.
In sum, the analysis of difference scores/ repeated measures ANOVA should be preferred to the traditional covariance approach. From our point of view it is the more appropriate strategy to evaluate intervention success.
The generalization of linear models to non-normal cross-sectional and panel data offers a variety of options for the analysis of normal and non-normal data. In the following section, we introduce log-linear techniques before turning to Generalized Estimation Equation (GEE) and Mixed Models, or Hierarchical Linear Models (HLM), as further options for longitudinal research. Although increasingly used by biomedical, behavioral, and public health researchers, the two latter strategies are still rarely used in HIV intervention research compared to more traditional ANCOVA and MANCOVA analyses (see Table 1). For comparison with ANOVA, we will again focus on the analysis of intervention effects when discussing GEE and HLM.viii
If count data are extremely skewed, as it is often the case, analyses of the linear relationships with other variables will not be appropriate. Ordinary linear regression can be substituted by Poisson or negative binomial regression. Both negative binomial and Poisson distributions meet the characteristics of discrete behavioral data with a disproportional high number of zero scores and a low number of extremely high scores (137, 152). The Poisson model is defined by the assumption that the conditional mean (i.e., the mean of y at a given score of x) is equal to the conditional variance (i.e., equidispersion). Although this assumption accounts for a correlation between observed score and measurement error, which is typical for most count distributions but violates the assumptions of most parametric tests, the Poisson model rarely fits count data distributions, which have often a conditional variance greater than the conditional mean. The negative binomial model allows for overdispersion and is therefore more appropriate than the Poisson model in most cases. Several statistical procedures allow a specification of the distribution characteristics and can be performed based on the Poisson or negative binomial model. Both the Generalized Estimation Equation (GEE) approach and Hierarchical Linear Modeling (HLM) offer regression analyses for panel data with non-normal distributions.
Further, models for zero-truncated count distributions can be applied when zero-counts do not occur because inclusion of a case in a sample is only possible after the occurrence of at least one event (e.g., the recruitment of sexually active persons only). The frequency of zero-counts can be estimated based on the variance of the zero-truncated distribution and the application of a particular model for complete distributions (Poisson, negative binomial). This might be useful in some contexts (e.g., if the probability of the target event or behavior in an unknown population is being evaluated). Zero-truncated and further zero-inflated models are described in detail by Long (137).
In longitudinal research with more than two assessments, repeated measures of discrete, normal and non-normal outcome variables can be analyzed using GEE. GEE and HLM have important advantages that include the possibility (a) to model non-normal outcome variables, (b) to account for individual differences in behavior change, and (c) to model the variance-covariance structure of the longitudinal data. We discuss each of these advantages in turn.
GEE is a particularly useful tool for longitudinal group comparisons with non-normal outcomes and multiple post-intervention assessments, as is needed in HIV intervention research. GEE assumes that a known transformation of the marginal distribution of the outcome (e.g., log10 (x+1)) is a linear function of the covariates or predictors. In contrast to the common fixed-effects models (e.g., ANOVA), GEE estimates “population-averaged” modelsix, using an extension of the quasi-likelihood approach. Quasi-likelihood makes few assumptions about the distribution of the dependent variable and, for that reason, is applicable to a wide variety of non-normally distributed outcome variables (153). The only requirement involves the specification of the mean-covariance structure. GEE uses an iterative procedure for the development of an estimator whose error has a mean of zero and is asymptotically multivariate Gaussian. However, this requires that missing observations be “missing at random” (153, 154). In GEE, the data are modeled by specifying the appropriate distribution family for the dependent variable (e.g., Poisson, negative binomial). If the data are not normally distributed, GEE is likely to yield considerably more test-power compared to repeated measures ANOVA with normalized variables. However, it is important to ensure that the specified distribution family provides a good fit for the dependent variable. A Poisson model is quite restrictive in its assumptions and may not be appropriate for most count measures. Falsely specifying a Poisson distribution may produce misleading results and indicate significant effects that may not apply to the true distribution of the outcome. In general, the significance achieved with the specification of a particular distribution family or correlation structure is not a valid indicator of the appropriateness of the model. The applicability of a distribution needs to be tested using diagnostics that are available in some standard statistical software (e.g., STATA™).
Even when working with normally distributed outcome variables, GEE might be preferred over classical ANOVA models because GEE treats individual change as a random variable. The advantages of a mixed design with a random individualized change variable and a fixed treatment group effect factor may be seen in a potentially increased test power. Further, GEE analyses are flexible in that they allow specifying the within-group correlation structure for the panels.
Despite its flexibility, increased test power, and easy implementation, GEE analyses are rarely applied in HIV intervention research. An exception is the study reported by Otto-Salaj, Kelly, Stevenson, Hoffmann, and Kalichman (89). As a consequence, many published results may underestimate intervention effects. Because of the advantages discussed, we recommend GEE or equivalent methods for the analysis of non-normal count data in longitudinal research.
Mixed models, also called hierarchical linear models (HLM), specify both fixed and random effects. Fixed effects apply to a factor whose levels are fully represented in a study. Random effects refer to a factor whose levels are considered a random sample of potential levels only, instead of a full representation of its levels. In intervention studies, the treatment and the time of assessment represent fixed effect factors, and individuals represent the random effects factor. This model can be described as a randomized block design with fixed treatment effects and random individual (block) effects. Individuals are treated as experimental units that are grouped into blocks (in this case the repeated observations within individuals), to which the treatments are randomly assigned. The advantage of mixed models is the estimation of individual effects over time compared to the averaging procedures offered in ANOVA models.
Mixed models share three features with GEE: (a) the possibility to model the covariance structure of the repeated measures, (b) the possibility to model time as a regression variable and to specify different regression effects for time (e.g., quadratic), and (c) repeated measures mixed models can be applied to non-normal data by specifying the correct distribution family. In addition mixed models can be used for longitudinal data sets with missing data, thus offering a convenient alternative to listwise or pairwise deletion or missing value substitution. This requires that data be missing at random.
The crucial difference between GEE and mixed models is that the random factor in mixed models gives estimates on individual level (that is, for each person), whereas the population-averaged method used by GEE provides estimates for the “average person.” In practice, however, both estimators tend to deliver similar results. For a more detailed discussion of differences between population-averaged models and mixed models, we refer to Neuhaus (155) and Neuhaus, Kalbfleisch, and Hauck (156).
Higher-order hierarchical linear models (HLM) are a special case of mixed models (151, 157) needed when the data yield a hierarchical structure. In randomized block designs, for example, longitudinal data are doubly-nested, with repeated measurements nested within individuals, who are nested within groups that are defined by blocks (classrooms, schools), or clusters. On the first level, HLM includes repeated measures as nested within individuals. On higher levels, HLM evaluates random group or cluster effects (e.g., of intervention sites), which in turn can be predicted by higher-level independent variables. HLM offers a solution to the dilemma of either ignoring the hierarchical structure and performing the analyses on individual level (with n = number of subjects) or aggregating individual-level data to the higher level and performing the analyses with these blocks of clusters only (with n = number of blocks). Because individuals in the same block can be assumed to be more similar than individuals belonging to other blocks, independence of observations cannot be assumed, which discounts analyses on individual level. However, performing the analyses on higher level only, means to lose (a) information available on individual level, and (b) test power by the reduction of n to the number of blocks. HLM allows both testing the effects of a treatment on individual level without ignoring the effects of higher-order organizing structures. Littell et al. (158), Raudenbush (151, 157), and Raudenbush and Bryk (157) provide more detailed information about mixed models and HLM‥
In sum, the application of GEE and mixed models is most appropriate and efficient for the analysis of count data when there are two or more post-intervention assessments available, when differences in individual change suggest to take individual trajectories into account, and when the data assume a non-normal distribution that can be modeled in the framework of the GLM (e.g., Poisson, negative binomial regression). Higher-order, multilevel models are indicated when intra-group similarities, or the effects of environmental clusters (e.g., multiple sites), have to be taken into account.
Non-parametric analyses are taken into consideration when the data quantify a variable on ordinal level only, or when the distribution of a variable violates the assumptions of linear parametric significance tests. For the latter case, a variety of solutions have been developed that overcome some of the limitations of non-parametric analyses. For that reason, these alternative methods may be preferred over non-parametric tests for the analysis of count data. However, if there is reason to believe that the data yield high measurement error and that the counts reported do not approximate interval or ratio-level, an investigator may choose to apply non-parametric analyses.
In general, we do not recommend the use of non-parametric means because of the loss of valuable quantitative information yielded in count data. However, it is a possible solution in dealing with extremely skewed data that are suspected to offer no more than rank order information. In this section, we briefly discuss two options: (a) analyzing untransformed count data on an ordinal level, and (b) analyzing counts that are reduced into ordered categories.
Although non-parametric tests usually have a lower test-power, this is not necessarily true when working with data that deviate extremely from a normal distribution. Spearman correlations, for example, can be higher and more significant than Pearson correlations when working with variables that differ strongly in their distributions. Similarly, a Wilcoxon or a Mann Whitney U-test may yield higher test power than a t-test when comparing sub-samples or experimental groups if the outcome variables are extremely skewed (159). A variety of non-parametric tests have been developed allowing multiple regression and analyses of variance with non-normal data (160, 161). Path analyses and complex theoretical model tests can be performed by LISREL (162) or EQS (163) using polychoric, polyserial, or Spearman rank correlation matrices instead of Pearson correlation or covariance matrices.
One strategy to deal with non-normal count data and count derivatives is to reduce the distribution into ordered categories. Similar to the distinction between counts and relative measures of condom use, categorizations of counts need to be distinguished from categorizations of proportions. The categorization of counts results in Likert-type answer options that define count intervals, with the highest category covering all scores that exceed its limit (e.g., > = 25 times). Thresholds of count distributions can be determined, for example, by the cumulative proportion of cases to be expected at certain percentiles of the normal distribution. This strategy can usually only be applied to the upper part of the distribution because of the accumulation of scores at the lower end and because of the distinct nature of the scores. The resulting categories may, for example, define 0 ( = never), 1 ( = 1–2 times), 2 ( = 3–5 times), 3 ( = 6–10 times), 4 ( = 11–25 times), and 5 (> 25 times).
Count data reduced to categories deliver no more information than a “count” measure that uses a categorical response scale to assess risk behavior. Thus, in choosing to reduce data into ordered categories, the question arises: Why were count measures of unprotected intercourse collected in the first place? We recommend considering whether this strategy threatens the validity of test results in a particular study. For example, categorization (as well as truncation of counts) may affect the distributions of several sub-samples differently, making them appear more similar or divergent than indicated by the counts, which may render the results of subsequent statistical group comparisons questionable. Further, the reduction of counts into categorical data involves loosing quantitative information about the cases that are likely to bear the highest risk.
In the following section, we briefly discuss logit models and structural equation modeling with ordinal or categorical outcome variables without further commenting on the utility of reducing counts into categories.
Logit models are the most commonly used and recommended ordinal regression models but have been applied rarely in HIV risk behavior research. Logit models allow interpretation of parameters in terms of average odds for the likelihood to score in one of two categories of the ordered distributionx. In general, the ordered logit model expresses relationships between predictors and ordered categories of the dependant variable as “average discrete change” in the predicted probability for each comparison between two categories of the outcome for a unit change in the predictor, holding all other variables constant (137). A variety of logit models are available that enhance flexibility in analyzing ordinal outcomes (164). For example, log odds can be defined for cumulative probabilities, informing about the odds of a response at or below a particular category simultaneously for all possible transitions of the outcome (“cumulative odds”). Alternatively, the probabilities of a response in a particular category in comparison to the next higher, adjacent category can be defined simultaneously for all transitions of the outcome. However, the assumption of parallelism (or proportionality of odds), which requires that the effects of the predictor variables are invariant across the entire scale of the ordered outcome, has to be tested before using ordered logit models. (In contrast to ordered logit regression, the ordered probit model is most often inappropriate for the analysis of categorized counts (as well as for assessments by Likert scales) because it requires normally distributed errors). For more detailed information about logit models see Agresti (165, 166) and O’Connell (164).
SEM with non-normal data involves a number of problems including increased Chi-Square values, underestimation of fit indices, and, most importantly, a severe underestimation of standard errors of the parameter estimates (167). Coarse categorization of counts and proportions as well as Likert-type assessments of relative condom use may offer a solution in approximating a normal distribution. However, alternative methods and estimation techniques tailored to ordinal and non-normal data should be preferred, in particular if categorizations into ordinal outcome variables do not lead to the desired normal distribution and thus do not eliminate the problem of underestimated errors.
Muthen (168, 169) has developed a method for latent variable mixture modeling that can handle any combination of categorical, ordinal, and continuous variables. In contrast to linear models, the estimator developed by Muthen and implemented in MPLUS delivers unbiased, consistent, and efficient parameter estimates for categorical and ordinal outcomes. This method also allows modeling of growth curves that involve categorical and ordinal variables (growth mixture modeling), which can be understood as a multiple-group analyses, similar to simultaneous group comparisons, except that group membership is unobserved and has to be elicited from the data (169). A particularly interesting application of growth mixture modeling is “complier average causal effect estimation” in which compliance is treated as a latent class (169, 170). Class membership (compliance) is observed in the treatment group only, whereas potential compliers in the control group have to be identified by an estimation procedure in order to allow comparisons with actual compliers.
Several “distribution-free” analytical methods have been developed specifically for interval and ratio-level data that ask for parametric analyses but do not fit the distribution assumptions required for their application. Similar to non-parametric analyses, these methods are distribution-free in the sense that they do not make any assumptions regarding the distribution of the outcomes. However, unlike truly non-parametric analyses, distribution-free estimators can be used to deliver parametric results. We briefly discuss two methods that may be applied to non-normal count data: (a) asymptotically distribution free (ADF) estimation of effects, and (b) bootstrapping.
An option for the analysis of extremely skewed data is ADF estimation combined with parametric or non-parametric tests. For example, structural equation modeling can be performed using asymptotic covariance matrices of non-normal bivariate distributions instead of analyzing polychoric and polyserial (non-parametric) correlations (162, 171). Model fit estimation with maximum likelihood (ML), generalized least square (GLM), or unweighted least square (ULS) methods require multivariate normality, an assumption that cannot be met if a variable is extremely skewed. Thus, ADF estimation in combination with the weighted least square (WLS) method may be a solution for skewed count data if the sample size is sufficiently large. The derivation of asymptotic covariance matrices from raw data requires samples of at least 200 to 500 participants even for simple models, and may require several thousand for complex models, depending on the number of variables involved (172). However, EQS offers an ADF statistic, called the Yuan-Bentler Corrected AGLS Chi-Square, which can be applied to smaller samples (167, 173).
Further, in testing group differences or correlations by nonparametric techniques, exact tests options are offered for diverse statistical procedures by many statistical packages. These tests calculate exact significance levels on the basis of asymptotic results even if the sample size is too small or otherwise violates the requirements of traditional significance tests or standard asymptotic analyses (174). The ADF method will lead to incorrect estimation of significance if samples or sub-samples are small, and if variables with a high proportion of zero-counts are analyzed. The exact test option offers efficient p-values for this kind of data (however, see the discussion of permutation tests).
Bootstrapping allows the application of parametric statistics that otherwise would not be appropriate with non-normal data. The most important feature of the bootstrap might be the estimation of standard errors without the need to meet specific assumptions regarding the distribution of the scores. The bootstrap substitutes for statistical inference based on sampling theory. Instead of inferring from sample data to population parameters and standard errors, sample data are used to “simulate” the sampling distribution. The desired statistic is computed many times by repeatedly drawing random samples of size N from the available data, with replacement. The resulting distribution of the sample statistics is used to provide information about the limit scores that determine the confidence interval of any desired size (usually 95%). A high number of repetitions ensures that the results are asymptotically correct.
In principle, the bootstrap can be used to estimate confidence intervals for any parametric test statistic that, otherwise, would require uni- or multivariate normal distributions and homoscedasticity (175, 176), and would be likely to deliver biased results with non-normal data (176–179). As a simple example, testing a mean or mean difference by t-tests is inappropriate with skewed variables. Bootstrapping solves this problem by simulating the sampling distributions of the means and providing model-free, distribution-based confidence intervals for the mean of each group, taking the given distribution characteristics (e.g., skewness) into account. Non-overlapping confidence intervals of the two means for p > = 95% (or p > = 90% in one-sided testing) indicate a significant difference. Similarly, the bootstrap can be used to test between-group or pre-post difference scores, correlation and regression coefficients, and ANOVA models without the need to rely on a pre-specified distribution model. The substitution of the bootstrap for parametric significance testing with non-normal data is justified because: (a) confidence intervals can be derived without any inference and reference to normal theory; (b) confidence intervals derived with bootstrap re-sampling will be adjusted to the skewness of the data; and (c) according to sampling theory, the sampling distribution (whether inferred or derived by an appropriate number of re-samplings) should approximate a normal distribution even for extremely skewed data with N > = 100 (thus allowing the application of parametric significance tests).
In SEM applications, AMOS can be used to estimate structural paths with bootstrap standard errors and confidence intervals. The Bollen-Stine bootstrap estimator implemented in AMOS optimizes the conditions for unbiased estimates of global model fit under multivariate non-normality also with smaller sample sizes (180).
The bootstrap has the disadvantage that confidence intervals for estimated parameters of extremely skewed data are usually large. Similar to parametric significance tests, bootstrap results may turn out to be unduly conservative (i.e., have less test power) compared to conventional non-parametric tests, if applied to skewed count data, which are likely to mask systematic effects of a predictor. Also, the bootstrap cannot ensure generalization of the results. As with other significance tests, bootstrap estimates are subject to sampling error and cannot resolve the problem of bias that may be due to the impact of some extreme, unrepresentative outliers or systematic drop-out. However, the use of bootstrap for count distributions that are transformed to approximate a normal distribution provides an alternative to the application of standard linear parametric tests if the transformed scores, as often with count data, still deviate strongly from normality.
Permutation tests, also called randomization, re-randomization, and exact tests, provide exact significance levels and are an almost distribution-free alternative to parametric tests. Originally developed in the 1930s, preceding computer technology, permutation tests are computer-intensive statistical techniques, applied primarily to the analysis of very small samples (175, 181). Permutation tests have much in common with the bootstrap technique. Similar to the bootstrap, the desired test statistic is computed many times with varying drawings from the sample. However, unlike the bootstrap, permutation tests permute systematically through all possible combinations of the sample observations, thereby leading to an exhaustive distribution of all possible test results and their respective probabilities. Based on this exhaustive likelihood distribution of all possible test scores, exact significance levels can be derived. Thus, the bootstrap can be regarded as an approximation to a permutation test: Whereas the bootstrap leads to asymptotically exact results, the permutation test provides exact test statistics (181).
Permutation tests are extremely versatile and almost universally applicable because they can accommodate any kind of data, providing exact test results on categorical, rank, interval, or ratio-scale level. In fact, common non-parametric statistics are usually realizations of permutation tests applied to categorical or ordinal data (181). For the analysis of interval- or ratio-level data, permutation tests are as powerful as their parametric counterparts. Their major disadvantage is the computation capacity required to perform exhaustive permutation with other than small samples. However, with the rapidly increasing capabilities of desk-top computers, the disadvantage of high computation demands connected with exhaustive permutation will soon be overcome. Thus, permutation tests are likely to gain popularity as a substitute for parametric analyses and bootstrap estimation in the future, also with large samples.
Permutation tests are not distribution-free, but very few assumptions have to be met for their application. For single-sample statistics, the distribution needs to be symmetric (181). Thus, in sexual risk behavior research, permutation tests may be especially helpful in analyzing pre-post difference scores or proportions of condom use derived from counts, which may assume a symmetric distribution but fail to approximate a normal distribution. For multiple samples, permutation tests lead to exact test results only under the assumption that observations be exchangeable, transformably exchangeable, or asymptotically exchangeable, under the null hypothesis. Observation are exchangeable if they are independent, identically distributed, or if they are jointly normal with identical covariances (181, 182). Exchangeability is given if the joint distribution of the observations is invariant under permutations of the units.
Although these assumptions are markedly relaxed compared to the requirements of parametric tests, they may render the application of permutation tests to strongly skewed count data problematic. Given the existence of single extreme outliers, permuted sample drawings may not lead to invariant distributions. Hayes (187), for example, has shown that the permutation test, although providing “statistically exact” results for the sample at hand, does not necessarily outperform parametric tests in drawing inferences on the population if distributions deviate strongly from their assumptions. For example, under conditions of extreme marginal non-normality, asymmetric marginal distributions, heteroscedasticity, or non-independent observations, the permutation test showed a similar likelihood to inflate the usual alpha-limits and lead to false rejection of the null hypothesis as the parametric counterpart.
Thus, although providing exact likelihood of sample results on data level, permutation tests may not lead to better decisions regarding the validity of study hypotheses in the population. For example, in RCTs with an uneven distribution of extreme outliers over treatment and control groups (heteroscedasticity), the permutation test may lead to similar biased results as parametric solutions, although homoscedasticity is not explicitly a requirement of this test (187). Further, as Hayes (187) has pointed out, exhaustive permutation of all observations may not apply to real-world scenarios, in which certain combinations of scores (such as a combination of anal and vaginal intercourse in a homosexual man) may never occur and thus do not need to be taken into consideration in decisions regarding the likelihood of the study results. In this case, exchangeability is violated, and the permutation test is likely to lead to biased results.
In sum, sexual behavior frequency data are likely to violate even the liberal assumptions of permutation tests. The statistically exact solutions provided by the permutation test may still remain of greatest advantage if samples are very small. Otherwise, for extremely non-normally distributed count data, parametric tests accommodating Poisson or negative binomial models may still be most appropriate.
In this article, we discussed count measures of sexual risk behavior and compared them to relative frequency measures of condom use. Further, we discussed a variety of data analytical options for the analysis of counts. Our discussion and review of the literature leads to the following conclusions:
First, count data and measures of relative condom use yield different information. Count data provide a more precise indicator of HIV contraction risk (and risk reduction), as needed for the evaluation of HIV risk reduction interventions and the calculation of cost-effectiveness of HIV prevention programs. Measures of relative condom use inform about the conditional likelihood of condom use (that is, the likelihood of safer sex if a person engaged in sexual intercourse) and are preferred in correlational studies and model-testing research. However, they do not yield sufficient information about the absolute frequency of unprotected intercourse. This conceptual difference is reflected by empirical findings showing a wide range of correlations between counts and relative frequency measures. With regard to their general utility, count data appear to be more versatile because they can be transformed into proportions and categorical data if needed. Because count data yield both absolute and relative frequency measures, they may be useful for both the quantification of HIV risk reduction in HIV intervention research as well as for testing theoretical models.
Second, because of the conceptual difference and lack of consistent empirical overlap between count and relative condom use measures, model-testing results obtained with the former may not be generalized to count measures of sexual risk behavior. Thus, although relative frequency measures of condom use may be well-suited to test theoretical models of health behavior, future model testing research should also attempt to predict absolute frequencies of unprotected intercourse with cognitive-motivational variables. Results obtained with counts would inform about the applicability of theoretical models to the outcome that may be of greatest interest in intervention evaluation and public health.
Third, count measures have several disadvantages that require sophisticated data preparation and analytical methods. Collecting and analyzing count data can be more time-consuming and expensive than using ratings or dichotomous measures, in particular if event-level data are collected that request the recall of each single sexual occasion over a specified time period. Further, count data are more challenging to analyze due to their distributional properties (i.e., strong deviations from normality). A variety of methods have been developed that are well-suited for the analysis of non-normal count data, including data transformations in combination with ordinary parametric significance tests, log-linear regression, GEE, mixed models, ADF estimation, permutation tests and bootstrapping, non-parametric tests, and the analysis of ordered categories. In discussing these data analytical options, we hope to encourage more flexibility in choosing the specific analytic strategy that is most appropriate for the data to be analyzed rather than applying the most common and best-known approach.
In randomized controlled trials testing the effects of HIV interventions on risk behavior reduction, GEE as well as mixed models (HLM) are promising alternatives to repeated measures ANOVA or change score analysis. Both GEE and HLM take individual trajectories into account, allow specification of the appropriate distribution family, and offer flexibility in modeling the variance-covariance structure. In most cases, they may increase test power compared to traditional linear parametric techniques. These advantages of GEE and HLM apply if multiple post-intervention assessments are available. If only a single post-intervention assessment is used to evaluate treatment effects, and if the data approximate a normal distribution (or can be transformed accordingly), ANOVA models may be most suitable. However, we discourage the use of ANCOVA for the analysis of RCTs. ANCOVA cannot resolve the problems connected with pre-existing differences between treatment groups, and the analysis of residual change does not guarantee unbiased estimates of intervention effects, as still widely believed. Fourth, we discourage methods that involve a loss of information such as reducing count measures to ordinal or categorical data. Categorization of counts does not only inflate the quantitative information about HIV contraction risk yielded in count measures. Another concern emerges from the possibility that categorizations and truncations of counts may affect the distributions in diverse sub-populations differentially, which may render the results of subsequent statistical group comparisons questionable. Further, the reduction of counts into categorical data involves losing quantitative information about the cases that are likely to bear the highest risk.
We conclude by acknowledging the complexity of data preparation and analysis with count data. Because of this complexity and the many data analytical strategies available, it is imperative that authors communicate their strategy clearly. Readers of journal articles need to be able to understand authors' decisions and their rationale. This need will become more important in the future. The growing number of statistical options and the development of new procedures reduce the likelihood that analytical methods can be regarded as “standard knowledge.” The complexity of count data analysis renders clear communication of the methods employed an essential step of a high-quality research project.
All authors are with the Center for Health and Behavior, Syracuse University. Correspondence and request for reprints to Michael P. Carey, Center for Health and Behavior, 430 Huntington Hall, Syracuse University, Syracuse, NY 13244-2340; email: mpcarey/at/syr.edu. This work was supported by the National Institute of Mental Health to Michael P. Carey (grants # R01-MH54929 and K02-MH01582). The authors thank Kate B. Carey, Martin J. Sliwinski, and Dan Neal and the anonymous reviewers for their comments on earlier versions of this paper.
iIn a companion paper, we address other assessment questions, including what sources of error affect the accuracy of sexual behavior self-reports, to what degree we can rely on retrospective self-reports, and how the accuracy of the data is affected by the population, assessment instruments, the assessment interval, and computerized methods (3).
iiThe present paper focuses on conceptual differences between absolute and relative frequency data. The validity of sexual frequency self-reports and the influence of moderating factors (e.g., self-reporting intervals, and assessment modes) are discussed in the companion paper (3). Similarly, the current paper does not comment on the effects of item construction or question sequence on cognitive processing of item content or the use of different recall heuristics. For discussion of these matters, readers are referred to the work of Schwarz (183, 184, 186) and Rothman et al. (185).
iiiAn option not addressed in this paper is the dichotomous assessment of condom use at last intercourse. This measure can neither be categorized as a count nor as a relative frequency measure of condom use because it refers to a single event only. Although assessment of last intercourse is likely to be very accurate, it may not be representative of a person’s sexual behavior and it provides no information regarding the cumulative risk of multiple events.
ivThe Box Cox power transformation may also be used to transform both predictors and outcomes in a regression model simultaneously, or to perform regressions based on the transformation of predictor variables only.
vStata also offers the “ladder” command, which indicates the alternative with the strongest approximation to normality among a number of possible transformations.
viWe recommend this strategy also in preparation of non-linear parametric analyses, as this allows applying a statistically defined limit score – based on normal theory – to the retranslated, non-normal outcome.
viiOne of the most compelling arguments for using difference scores in the analysis of count data may be that they are more likely to approximate a normal distribution. Difference scores eliminate (or at least reduce) the problem of extreme skewness inherent in count measures of sexual risk behavior, although they do not necessarily improve the over-sized kurtosis and dispersion of the data. Transforming count data prior to the computation of difference scores cannot eliminate this problem but can improve the distribution.
viiiLatent curve analysis via Structural Equation Modeling (SEM) is a further option to be mentioned in the framework of the GLM (150, 151). However, this approach is unusual for the analysis of intervention effects on behavioral count data and seems inappropriate for absolute frequencies of manifest behavior in specified time intervals before and after treatment.
ixAn effect is random if the levels of the factor are considered to be a random sample from a larger population of potential levels.
xLogistic and ordered logistic regression are non-parametric in the sense that they do not rely on Gaussian errors. However, they do rely on an underlying assumption of some distribution (Bernoulli); consequently, these procedures may, by some people's definition, be considered a parametric test.