|Home | About | Journals | Submit | Contact Us | Français|
The reliability of self-reported sexual behavior is a question of utmost importance to HIV-prevention research. The Timeline Followback (TLFB) interview, which was developed to assess alcohol consumption on the event level, incorporates recall-enhancing techniques that result in reliable information. In this study, the TLFB interview was adapted to assess HIV-related sexual behaviors and their antecedents, and its reliability was assessed. The interview was administered to 110 participants (46% women, M age = 19.7; range = 18 – 41), and 58 participants who reported sexual behavior during the previous three months returned one week later for a second interview. Test-retest intraclass correlations (□) from the TLFB protocol showed that all sexual behaviors were reported reliably (□ range = .86 to .97, median = .96). Bootstrapping, a non-parametric statistical technique, was used for significance testing in the reliability analyses. Reliability was equivalent across each of the three months assessed with the TLFB, and was equivalent to conventional assessment methods (i.e., single-item questions). These findings show that the TLFB sexual behavior interview provides reliable reports of sexual behavior over 3 months and yields event-level data that are extremely valuable for sexual behavior and HIV-prevention research.
Measurement of sexual behavior is crucial to HIV-prevention research, but obstacles related to recall are often not addressed in this domain. Croyle and Loftus (1) detailed the influence of the constructive nature of memory on self-reports of sexual behavior. Simple forgetting, telescoping (distorting the recency of particularly memorable events), exposure to misleading information since the event, and the use of heuristics to estimate behavior frequencies are some of the factors related to recall that can contribute to unreliable self-reports. Further, other factors such as intentional over and underreporting of sensitive behaviors are suspected to bias results (2). Because there are limited assessment alternatives, retrospective self-report sexual behavior data are used to determine the prevalence of behaviors that place individuals at risk for HIV infection, track the transmission of HIV, identify antecedents of risk behavior, identify subpopulations to be targeted for intervention, and evaluate interventions designed to reduce risk behavior (3). Therefore, it is important that researchers use measures of sexual behavior that are psychometrically sound. However, measures typically used in HIV-prevention research are not generally standardized or tested for reliability.
A promising technique to assess sexual behavior is the Timeline Followback method (TLFB; 4, 5). The TLFB was first used to obtain retrospective self-reports of daily alcohol consumption during a specified reporting period. Several interview aids are used to facilitate recall, including presenting respondents with a calendar, marking salient events in the reporting period (e.g., hospitalizations), identifying lengthy periods of abstinence or patterned drinking (e.g., drinking only on weekends), and anchoring the highest and lowest quantity consumed in the target interval. In a brief interview, data are yielded regarding the number of drinking days, as well as the average and total quantity consumed during drinking episodes. Advantages of the TLFB procedure include the extensive information yielded on the patterning of drinking behavior, as well as the ability to summarize data on quantity and frequency over a variety of intervals.
Research on the TLFB resulted in its current use to assess alcohol, marijuana, cocaine, and inhalant use (6). Furthermore, it has been administered successfully to a variety of populations including problem drinkers in inpatient, outpatient, and residential alcohol treatment (7); normal drinkers (8); and psychiatric outpatients (9). In addition to the original face-to-face interview format, the protocol has been adapted for phone interviews and computerized assessment (10). Reporting periods ranging from 1 month to 1 year have been used. There is ample evidence that the TLFB method yields reliable and valid self-reports of socially-sensitive substance-use behaviors (6, 11).
A major strength of the TLFB method is that it yields event-level data. Event-level inquiry allows more detailed assessment and analysis than traditional single-item quantity or frequency assessments. To illustrate this benefit, consider investigations of the association between substance use and risky sexual behavior. Studies that have used global assessments of frequencies of substance use and sexual behavior typically have found an association between substance use and unprotected sex (e.g., 12, 13). However, these studies do not provide evidence that drinking and unsafe sexual behavior occurred on the same occasions.
In the first published application of a modified version of the TLFB to assess substance use and sexual behavior, Crosby, Stall, Paul, Barrett, and Midanik (14), studied 131 gay and bisexual men. Event-level data revealed, for example, that men who had unprotected anal sex consistently under the influence of alcohol and/or drugs had less education, lower income, and were more likely to use amyl nitrate and/or cocaine. Thus, the use of event-level data allowed for a more fine-grained analysis of the relationship between substance use and sexual behavior. Consequently, one advantage of the TLFB approach is that it allows event-level analysis of risk behavior, which is valuable for intervention planning and evaluation. Single-item frequency measures typically used in HIV-prevention research cannot provide such information.
The modified TLFB interview that Crosby et al. (14) used is a significant methodological advance, and suggests that the TLFB can be applied to self-reports of sexual behavior. However, there were three limitations to Crosby et al.'s research. First, the interview assessed only protected and unprotected anal intercourse. This may reduce its value for use with populations other than men who have sex with men. Second, the interview assessed behavior occurring 1 month prior to the most recent sexual encounter. Because the frequency of sexual behavior can vary widely over time for an individual, a longer reporting period needs to be examined. Finally, Crosby et al. did not evaluate the reliability of their measure. Therefore, the purposes of this project were to extend the pioneering efforts of Crosby et al. by (a) adapting the TLFB technique to assess a variety of sexual behaviors; (b) extending the interval to three months, a more common reporting period in HIV-prevention research (15); (c) evaluating the test-retest reliability of the instrument in a face-to-face interview format; and (d) comparing the reliabilities of the TLFB variables to those of commonly-used single-item sexual behavior frequency questions.
One hundred ten college students (51 women, 59 men; M age = 19.7; range = 18 – 41) enrolled in an introductory psychology course participated for course credit. This is a sexually active population, which is at risk for a variety of STDs; among college students, 1 in 500 are HIV-positive (16). Ninety percent of the participants were in their first or second year of college.
The protocol was adapted from current TLFB manual and materials (6). Specific sexual behaviors (e.g., insertive and receptive vaginal, oral, and anal intercourse) were defined explicitly to minimize ambiguity. Consistent with recommendations for sexual interviewing (17, 18), participants were encouraged to use terms they preferred for these behaviors, and interviewers used the participants' terms throughout the interview.
Next, interviewers presented a calendar that included the beginning and end dates of the reporting period, with campus events and holidays already identified. Participants indicated other days in the reporting period that were memorable for them, such as exams, road trips, newsworthy events, sporting events, and family visits, by writing the event on the calendar. Participants were instructed to identify all days on which they were sexually active. To do this, participants were prompted to consider the memorable events they had recalled and any patterns to their sexual encounters (e.g., weekend visits to partners). Beginning with the most recent sexual activity, and for every sexual event, participants were asked a series of questions to assess (a) the type of relationship with the sexual partner (i.e., monogamous, nonmonogamous), (b) type and number of sexual activities, as well as the occurrence of (c) possible antecedents of HIV-risk behavior (e.g., alcohol consumption), and (d) HIV-preventive behavior (e.g., talking with a partner about using condoms before having sex). Interviewers coded participant responses directly on the TLFB calendar on each day that sexual activity was reported.
A series of 28 single-item (SI) sexual behavior frequency questions was also included. For example, one SI was “How many times in the past three months did you have vaginal intercourse without using a condom?” These items are similar to those recommended for assessment by Kelly (19), and to those commonly used in HIV-prevention research (e.g., 20–23). Each SI frequency question corresponded to a sexual behavior item yielded by the TLFB interview.
One male and one female research assistant were trained to conduct interviews with participants of the same gender. Training focused on sexual behavior interviewing, HIV-prevention, and research ethics and confidentiality. Interviewers were trained through several role-played interviews simulating typical and difficult interview situations, which were observed by the first author. They received supervision throughout the course of the investigation.
Interviews were conducted individually in sound-attenuated private offices. After obtaining informed consent, the interviewer audiotaped the interview, which was later reviewed for adherence to the interview script. Participants completed the demographics questionnaire. Next, the interviewer conducted the structured interviews including the questions regarding general health, sexual functioning, history of sexually transmitted diseases, the single-item frequency questions, and then the TLFB interview was administered. The order in which questions were asked was planned to gradually increase the sensitivity of content, which is advised in sexual behavior interviewing (18). The order of the sexual behavior measures (single-item questions followed by the TLFB procedure) was planned to move from less detailed to more detailed questioning. Finally, participants completed a debriefing questionnaire. Participants who reported vaginal, oral, or anal intercourse during the reporting period were invited to return one week later for a second interview.
During the second interview, participants were reminded of the purpose and confidential nature of the project. They were re-interviewed about sexual behavior with the single-item frequency questions and the TLFB procedure. Behaviors were assessed for the same three-month reporting period used in the first interview.
Two characteristics of the data influenced our approach to reliability analysis. First, as is the case with sexual behavior and other low base-rate behaviors, the data distributions were positively skewed (i.e., for each behavior, many people report a frequency of 0 or 1, whereas a smaller proportion reported frequencies of 2 or more). This violates the assumption of normality on which tests of significance of correlation coefficients are based. Second, each comparison of TLFB to SI reliability, and each comparison of the reliability of individual months assessed with the TLFB, involved four variables (e.g., rwx and ryz). Therefore, a traditional approach to comparing dependent correlations (24, pp. 215–216), which is based on the relations among three variables (e.g., rxy and rxz), could not be applied.
Because of these two characteristics, significance testing for the reliability analyses was done with the aid of bootstrapping (25) performed with Stata statistical software (26). The advantage of bootstrapping to evaluate the magnitude of correlation coefficients is that there is no assumption that the population distribution is Gaussian in form (27). Bootstrapping is a nonparametric technique that simulates the effects of repeated sampling of the variables being analyzed from the population of interest by repeatedly sampling from the available sample. Calculating a correlation from each such resample creates an approximation to the sampling distribution of the correlation, an approximation that imposes no assumptions about the population distribution. Thus, the magnitude of observed correlations can be evaluated relative to what would be expected from this distribution.
In our reliability analyses, the bootstrap technique was implemented as follows. A random sample of size N (in this case, 58) of the variable pairs (e.g., number of occasions of unprotected vaginal intercourse reported at first and second assessments) was selected from the dataset, with replacement. The reliability coefficient was computed from this randomly selected sample and converted to a z score using Fisher's r to z transformation. This procedure was repeated 1000 times for each behavior, resulting in a distribution of Fisher’s z transformations from which the mean, bias, standard error, and 95% confidence interval (CI) were calculated. When comparing two reliability coefficients (i.e., reliability of individual months of the TLFB assessment, or TLFB versus SI reliability), the same procedure was implemented, with random selection of sets of four variables with replacement; the Fisher's z distributions were generated simultaneously for the two coefficients. From the resulting distribution, the mean, bias, standard error, and 95% CI were obtained for each of the two reliability coefficients and for the difference between the two reliability coefficients.
No participants refused to complete the initial interview. Overall, participants rated their experience with the study as somewhat interesting and “not-at-all” to “slightly” embarrassing. Participants indicated that they were very comfortable with the protocol, and that they viewed the study as “somewhat” to “very” important. Sixty-four (58%) of the participants reported sexual behavior during the previous 3 months; these individuals were invited to return for a second interview approximately one week later (M test-retest interval = 9.2 days, SD = 2.8, range = 6–16 days). Five eligible men declined to participate in the second interview because they had completed the required hours of research participation for course credit; one woman could not be re-contacted. Thus, the retest sample consisted of 58 individuals (50% women), 91% of the eligible participants. The single-item frequency questions required approximately five minutes to administer. The time required for the TLFB portion of the interview varied from 5 to 20 minutes as a function of frequency of sexual behavior.
Analyses using the Bonferroni adjustment for multiple comparisons indicated that the SI and TLFB assessments yielded equivalent frequencies on all sexual behavior variables. Table 1 reports the percentage of the sexually active retest sample that reported each behavior. Several behaviors, including the use of a barrier (e.g., dental dam) during any type of oral sex, and anal sex (protected and unprotected, insertive and receptive), were not reported by a sufficient number of participants to compute reliability coefficients and were not included in the reliability analysis.
The test-retest reliability of participants’ responses on each behavior variable was computed using the intraclass correlation (□). When computing the intraclass correlation for each behavior, we retained data from participants who reported a frequency of zero for that behavior (see Table 1 for proportion of participants who endorsed each item), because all participants in the final sample were sexually active. Table 2 reports the test-retest reliability intraclass correlation coefficients for each behavior, by gender. Table 3 reports the test-retest reliability intraclass correlation coefficients for each reported behavior, by assessment method, and by month assessed with the TLFB.
Analysis of reliability by assessment mode and gender for each behavior (see Table 2) revealed no significant differences between women's (TLFB range □ = .79 to .98, median □ = .94; SI range □ = .71 to .98, median □ = .90) and men's (TLFB range □ = .86 to .97, median □ = .96; SI range □ = .62 to .96, median □ = .83) reports. Using the bootstrapping technique described above, the magnitude of difference between reliability of TLFB and single-item frequency questions by gender was evaluated for each behavior by using bootstrapping to construct a 95% normal theory1 CI around the difference between the mean Fisher’s z transformation scores. Zero was contained within the 95% CI for each behavior. Comparisons of the reliabilities between TLFB responses from men and from women and between SI responses from men and from women were conducted in a similar manner. Again, zero was contained within the 95% CI for each comparison. Thus, the magnitude of reliability did not differ as a function of gender. Because there were no gender differences, men's and women's data were combined for subsequent analyses.
In the total sample, the TLFB yielded high reliability coefficients for all behaviors (TLFB range □ = .86 to .97, median □ = .96; see Table 3). Using bootstrapping, we constructed 95% CI around the mean Fisher’s z transformation for each behavior. Zero was not included in the CI for any behavior, indicating that TLFB test-retest coefficients were significant for all behaviors, taking into account the positively skewed distributions. The single-item frequency questions resulted in reliability coefficients ranging from □ = .81 to .96; median □ = .89 (see Table 3). The magnitude of difference between reliability of TLFB and single-item frequency questions was evaluated for each behavior by using bootstrapping to construct a 95% CI around the difference between the mean Fisher’s z transformation scores. Zero was contained within the 95% CI for each behavior. Thus, the magnitude of the correlations did not differ as a function of method.
Table 3 also contains the test-retest reliability coefficients for each behavior by each month assessed with the TLFB protocol. Bootstrap analysis revealed that zero was contained within the 95% CI of the difference between the mean Fisher’s z transformation scores compared for each behavior. Thus, reliability did not differ significantly across the three months for any behavior.
The primary finding emerging from this investigation was that the Timeline Followback techniques resulted in test-retest reliability coefficients between □ = .86 and .97 for all behaviors reported. These coefficients, as well as those from the SI frequency questions, were well within the acceptable range for both research and clinical use (28, pp. 264–265). Reliability remained stable for each of the three months assessed, indicating that the TLFB protocol resulted in reliable reporting for a period of up to three months with no appreciable drop-off in reliability after the first or second month. Kauth, St. Lawrence, and Kelly (29) suggested that three months may be the upper limit for acceptably reliable reporting of sexual behavior using SI questions. The ability of the TLFB to produce reliable reports for longer reporting periods warrants further investigation.
Based on these data, and other reliability-evaluation studies involving single-item sexual behavior frequency questions (e.g., 30–32), it appears that both the SI and the TLFB methods are reliable. However, the TLFB interview has the added benefit of producing more specific and useful event-level information (33). This is particularly important for investigating the correlates of HIV-related risk-taking and for evaluating the effectiveness of risk-reduction interventions. It would be difficult to determine with single-item frequency questions, for example, whether participants are more or less likely to talk with a sexual partner about condom use when they have been drinking, and whether this relationship changes as a result of an intervention. However, data collected with the TLFB procedure allow this type of analysis (14). Other examples include evaluating the degree of co-occurrence of HIV risk behavior with risk factors such as substance use and or mood changes (34), as well as preventive factors, such as assertive communication about prevention with partners (35). Such relationships can be identified in target populations and addressed in interventions. The effectiveness of interventions designed to target the hypothesized determinants of risky sex can then be evaluated at the event level.
The need for this type of event-level information must be weighed against the resources available for assessment. At present, the TLFB sexual behavior assessment has been examined only in a face-to-face interview format, which requires more personnel resources than does a survey administered to groups of participants. Also, because of the specificity of the data gathered, a single administration of the TLFB typically takes more time than traditional single-item frequency questions. Given these relative costs, the unique level of analysis afforded by TLFB data will not be necessary or desirable in all studies of sexual behavior.
Two limitations of this study should be acknowledged. First, it is possible that the order in which content was presented affected participant self-reports. However, prior studies with college students found that order did not influence measurement error for sexual questions in face-to-face interview format (36). Second, we did not assess the validity of the TLFB. The goal of this initial investigation was to assess the reliability of the TLFB sexual behavior interview, a prerequisite to establishing its validity. Validation of self-report measures of sexual behavior is challenging because direct observation is neither ethical nor practical. Validation using strategies such as collateral partner interviews and concurrent self-monitoring data is needed.
We conclude that the TLFB sexual behavior interview yields reliable, event-level data on self-reports of sexual HIV-risk behaviors and HIV-preventive behaviors for a three month reporting period. Further research is encouraged (a) to determine if the TLFB provides reliable data over longer intervals, and (b) to evaluate evidence for the validity of self-reported sexual behavior. In addition, it would be useful to determine the importance of the interview-facilitated memory aids for eliciting reliable self-reports. Although both the SI and TLFB approaches provide reliable estimates of sexual behavior, the TLFB also provides additional event-level information that is needed for more precise understanding of the context of risk behavior, and the design and evaluation of HIV-prevention interventions.
This research was supported by grants from the National Institute of Mental Health to Lance S. Weinhardt (F31-MH11125) and Michael P. Carey (K21-MH01101 and RO1-MH54929), National Institute on Alcohol Abuse and Alcoholism to Stephen A. Maisto (AA10291), and National Institute on Drug Abuse to Kate B. Carey (R29-DA07635). The authors thank John R. Gleason and Andrew D. Forsyth for statistical consultation, and Christopher M. Gordon for comments on a draft. Correspondence concerning this article should be sent to Michael P. Carey, Ph.D., Department of Psychology, 430 Huntington Hall, Syracuse University, Syracuse, NY 13244-2340.
1Stata’s bootstrapping function produces 3 versions of confidence intervals: normal theory, percentile, and bias corrected. Because the three types of confidence intervals were indistinguishable for practical purposes, and the Fisher’s z transformation distributions were Gaussian, the normal theory confidence intervals were used for all analyses.