In this paper, we consider a two-group comparative (ie, treatment vs. control) LS in both clinical and simulated examples. Although here we use a two-group design, statistical inferences can be readily generalized for LS involving a different number of groups.

Hypothesis Testing for a two-group Comparative LS

The most common research question in two-group comparative LS is whether the treatment group follows the same trend in the mean outcome over time as the control group. This is equivalent to testing the significance of the interaction effect between treatment and time. To prepare for hypothesis testing at the analysis stage, a study needs to set a null hypothesis (H0) and alternative hypothesis (H1) prior to start. The null hypothesis, as the name suggests, specifies that there is no difference in outcome between the treatment and control groups at each time point (ie, no interaction effect). Ideally, the alternative hypothesis will specify the smallest clinical effect of the treatment that, if proven, will prompt the treatment to be adopted in clinical care. For simplicity, we assume the mean treatment and control group effects follow linear trends over time (). Therefore, the hypothesis test for an interaction effect can be converted to a comparison of the slopes of the 2 straight lines (St: slope of treatment, Sc: slope of control). As illustrated in , when there is no interaction effect (H0), the 2 lines are parallel (ie, St=Sc); otherwise St≠Sc. The difference between St and Sc represents the magnitude of an interaction effect.

As shown in , there are 2 correct and 2 incorrect actions (types of error) associated with hypothesis testing. Type I error, or false positive, is the probability of rejecting the null hypothesis of equal slopes when, in fact, St=Sc. Type II error, or false negative, is the probability of accepting the null hypothesis of equal slopes when, in actuality, St≠Sc. Power, the complement of the probability of Type II error (= 1-Type II error or false negative), is the probability of detecting a difference between the 2 slopes when, in fact, St≠Sc. By convention, to help prevent the investigator from making false claims, the rates of Type I error and Type II error are set at low levels of 5% and 20% (ie, 80% power), respectively. We can ultimately accept or reject the null hypothesis based upon statistical analysis of the collected data.

Repeated Measures ANOVA

RM-ANOVA relates the study outcome variable to a set of covariates (eg, treatment group, time) and compares the mean outcome at multiple time points or between groups. Although RM-ANOVA (one of the earliest proposed methods for analyzing correlated responses) has gained widespread popularity, it has several unattractive features. First, RM-ANOVA requires the outcome variable to be quantitative (ie, a continuous variable) and normally distributed. It also requires the covariates to be discrete (ie, categorical variables). Second, RM-ANOVA requires that the outcome have constant variance across time points as well as constant correlation between any 2 time points (ie, assumption of sphericity). The assumption of constant correlation of repeated measures is often unrealistic in medical research as repeated measures often become less correlated with increasing time from treatment. This kind of violation of the sphericity assumption may cause inflated Type I error^{20}. Third, RM-ANOVA can only handle longitudinal studies in which all subjects have the same numbers of repeated measurements. Specifically, RM-ANOVA excludes those subjects who have missing observations at 1 or more time points (a common occurrence in a LS). Inclusion of only those subjects who have “complete” data for all variables has unfavorable consequences. The group of subjects with “complete” data may not represent a random sample from the target population, thus producing biased results. Further, statistical power is reduced by this artificial attrition in sample size.

Generalized estimating equations (GEE)

The GEE method focuses on average changes in response over time and the impact of covariates on these changes. The method models the mean response as a linear function of covariates of interest via a transformation or link function. To accommodate various types of outcomes that are not necessarily normally distributed, different link functions are employed for modeling the relationship between outcome and covariates. For example, an identity link function is used for a continuous outcome, a logit link function for a binary outcome, and a log link function for count data.^{21} These transformations can be considered repeated measures analogs of linear regression, logistic regression, and Poisson regression, respectively. In addition, to account for variation in correlation between repeated measures, GEE allows specification of the correlation structure from a wide variety of choices. Popular choices, among others, include the compound symmetry (CS) correlation structure and the autoregressive (AR(1)) correlation structure. The CS correlation structure assumes a common correlation for any pair of responses at different time points, while the AR correlation structure assumes that measurements closer in time have a higher correlation than those that are further apart. GEE also has appealing and robust properties in parameter estimation. Unlike RM-ANOVA, GEE does not require the outcome variable to have a particular distribution. This feature can greatly benefit studies in which data are skewed or the distribution of data is difficult to verify due to a small sample size.

Mixed effects models (MEM)

MEM describes how the response of the individual participant changes over time. It takes into account between-individual heterogeneity by adding random effects to a subset of covariates of interest. These added random effects allow covariate coefficients to vary randomly from 1 individual to another, thereby providing an individual response trajectory over time. The most common MEM in longitudinal studies are those with random effects attached to baseline values or time dependent variables (eg, postoperative day), reflecting heterogeneity among individual responses at baseline (eg, heterogeneous pain scores at baseline), or variation between individual trajectories over time (eg, heterogeneous rates of change in pain). In addition, like GEE, MEM allows specification of the correlation structure between repeated measurements from similar choices, such as the CS and AR(1).

Pattern of Missing data

In the statistical literature, missing completely at random (MCAR) and missing at random (MAR) ^{22} are 2 popular data mechanisms in the context of GEE and MEM. Data are MCAR if the occurrence of missing data is independent of both observed and unobserved outcomes. For example, data missing from a patient who has dropped out of a longitudinal trial because he/she has relocated is considered MCAR. This ‘missingness’ has nothing to do with the treatment effect and its outcome. Alternatively, when missing data is dependent solely upon the observed outcomes, data are considered MAR. For example, when a patient drops out of a trial due to treatment-related adverse effects, any data missing for this patient is classified as MAR. The latter is considered a more serious kind of ‘missingness,’ so special methodological adjustments must be made for data with this issue.

Clinical data example and analysis

A study^{7} conducted in the Department of Anesthesiology at the Hospital for Special Surgery and published in a 2010 issue of RAPM is used for illustration. Thirty-four patients undergoing unilateral total knee arthroplasty (TKA) under tourniquet ischemia were enrolled with 50:50 randomization to either an episode of limb preconditioning before induction of ischemia for surgery or to a control group with no preconditioning. C-reactive protein (CRP) level and postoperative pain scores were 2 outcomes of interest. CRP, a marker of inflammation, was measured at baseline, 6 hours, 12 hours and 24 hours postoperatively. CRP will be used as the continuous outcome in this example. A median pain score for each patient was also obtained for every 6-hour interval postoperatively during the first 48 hours. We convert the pain score to a binary variable for this example by considering a pain indicator to be ‘1’ if a patient’s median pain score at any time point is greater than 0, and ‘0’ in all other cases (implying no pain).

This example provides an illustration of longitudinal data analysis for a continuous outcome (ie, CRP) and a binary outcome (ie, pain). There were also missing CRP and pain scores at various time points for 7 and 26 patients, respectively. For illustrative purposes, only time, treatment group, and the interaction between time and treatment were included in the models. All 3 methods were used to model the continuous outcome (CRP), but only GEE and MEM were used to model the binary outcome (pain) because RM-ANOVA cannot handle non-continuous outcomes. All statistical analyses were performed in SAS version 9.2 (SAS Institute, Cary, NC).

Simulated Data Generation, Analysis, and Reporting

Guided by the literature review, we generated data using 2 sample size settings [low (~8 per group) and moderate (~20 per group)] with 4 repeated measurements. We induced scenarios with complete data and with incomplete data for 20% of the subjects at different time points. We contrasted the operating characteristics of the 3 methods in terms of empirical Type I error and power to detect a significant interaction effect between treatment and time. Mean outcomes for the treatment and control group over time are represented by linear trends (). Data were generated for Type I error analysis by assuming the slopes of the treatment and control were equal (ie, no interaction effect, ), and for power analysis by assuming the 2 slopes were different (ie, interaction effect exists, ), respectively. For illustrative purposes we chose St=1.55 and Sc=1 in order to have a power of 80% with complete data for a moderate sample size (n=20). When St=1.55 and Sc=1, we mimicked a clinical study, where the outcome of 1 group increased faster than the other by 55% per time unit. To further evaluate the impact of the number of repeated measures (r) and sample size (n) on power, we simulated additional data over a wide range of n (8, 20, 30, 40, 50) and r (4, 6, 8, 10, 12). Data generation and statistical analyses were performed in R (R Foundation for Statistical Computing, Vienna, Austria) and SAS version 9.2 (SAS Institute, Cary, NC), respectively. Statistical significance was set at 0.05.