|Home | About | Journals | Submit | Contact Us | Français|
Participants analyzed actual and simulated longitudinal data from the Framingham Heart Study for various metabolic and cardiovascular traits. The genetic information incorporated into these investigations ranged from selected single-nucleotide polymorphisms to genome-wide association arrays. Genotypes were incorporated using a broad range of methodological approaches including conditional logistic regression, linear mixed models, generalized estimating equations, linear growth curve estimation, growth modeling, growth mixture modeling, population attributable risk fraction based on survival functions under the proportional hazards models, and multivariate adaptive splines for the analysis of longitudinal data. The specific scientific questions addressed by these different approaches also varied, ranging from a more precise definition of the phenotype, bias reduction in control selection, estimation of effect sizes and genotype associated risk, to direct incorporation of genetic data into longitudinal modeling approaches and the exploration of population heterogeneity with regard to longitudinal trajectories. The group reached several overall conclusions: 1) The additional information provided by longitudinal data may be useful in genetic analyses. 2) The precision of the phenotype definition as well as control selection in nested designs may be improved, especially if traits demonstrate a trend over time or have strong age-of-onset effects. 3) Analyzing genetic data stratified for high-risk subgroups defined by a unique development over time could be useful for the detection of rare mutations in common multi-factorial diseases. 4) Estimation of the population impact of genomic risk variants could be more precise. The challenges and computational complexity demanded by genome-wide single-nucleotide polymorphism data were also discussed.
The identification of genetic risk factors is commonly based on case-control or cross-sectional phenotype data in mixed-age population samples. These study designs, however, do not allow for exploration of genetic and environmental risk factors that might influence the development of traits over time. If the genetic risk factors are sensitive to time effects, the assignment of case and control status in genetic association studies may be biased.
Longitudinal data analysis in genetic studies is a unique, although rarely implemented, strategy that provides several advantages and challenges. The longitudinal design provides additional information regarding time of onset and therefore may allow for a more precise definition of case and control status in association analysis. The longitudinal information is particularly valuable for traits with variable age of onset and for traits that are heterogeneous with regard to development over time. Longitudinal studies also allow for the prospective measurement of time-varying factors that are not typically included in genetic case control association studies focused solely on genetic effects, even though interactions with time-varying environments may be important. This implies that time-variant covariates can be measured with improved precision in longitudinal data analyses. Finally, the genetic trait of critical interest may not be the occurrence of an event, but rather the trajectory of performance or decline over time. Longitudinal designs are uniquely able to capture this type of genetic trait via trajectory classes or quantitative slopes.
Besides design issues that address phenotype definition and population heterogeneity, longitudinal analysis also provides several advantages for the statistical analysis of genetic data. Instead of binary or cross-sectional quantitative traits, intercepts or slopes of repeated measures can be used as quantitative phenotypes, which may capture certain characteristics of population subgroups with unique genetic or environmental risk factors. The analysis could capitalize on non-linear shapes of the growth curves, knot-points, or other features of trajectories that may capture variability or stability over time.
In the Genetic Analysis Workshop 16 (GAW16), participants in Group 14 explored the advantages and challenges of a design that integrated either actual or simulated longitudinal phenotype data of the Framingham Heart Study (FHS) with genome-wide single-nucleotide polymorphism (SNP) data. The participating groups took a variety of approaches capitalizing on the additional information provided by this longitudinal study design. The approaches included use of incidence-density sampling rather than cross-sectional case-control selection, linear mixed modeling [McLean et al., 1991], generalized estimating equations [Zeger and Liang, 1986], multivariate linear growth curve modeling [Duncan et al., 1999], growth mixture modeling (GMM) [Muthén, 2004; Muthén and Asparouhov, 2008], population attributable risk fraction (PAF) based on survival functions under the proportional hazards model, and multivariate adaptive splines for the analysis of longitudinal data (MASAL) [Zhang, 1997; Zhang, 2004]. These contributions will be described in the rest of the paper.
The Framingham Heart Study (FHS) is the oldest prospective longitudinal cohort study of cardiovascular risk factors in the U.S. [Cupples et al., 2007]. The original FHS cohort included presumably unrelated healthy individuals from Framingham, MA. About 5,209 subjects (Original Cohort) were recruited between 1948 and 1953 and were followed by biannual exams for potential cardiovascular risk factors (Cohort 1). The FHS was extended (Offspring Cohort) to ascertain children and spouses of the Original Cohort, enrolling an additional 5,124 individuals between 1971 and 1975 (Cohort 2). Finally, 4,095 children of the Offspring Cohort (the grandchildren of the original participants, Generation 3) extended the study over three generations (Cohort 3). Participants were examined in regular intervals; however, because the study spanned three generations, the age at each exam was not uniform. For GAW16, longitudinal data were available only for Cohort 1 and Cohort 2. In contrast to other cohort studies, the participants of the FHS in the combined set of cohort collections are related to one another. Indeed, some families even consist of several hundred individuals. Each investigative team that participated in Group 14 selected different subsets of the data to address their specific hypothesis of interest, although all topics were related to longitudinal data analysis. To highlight their contributions, Group 14 participants' names are given in bold throughout.
In this paper, we will review how the Group 14 investigators applied general longitudinal data analyses: in all individuals in Cohorts 1 and 2 [Luan et al., 2009; Zhu et al., 2009], male individuals from Cohort 1 and Cohort 2 [Kerner and Muthén, 2009], incident cases and controls [Fradin and Fallin, 2009], unrelated individuals from Cohort 2 [Roslin et al., 2009; Yan et al., 2009], and select persons from Cohort 2 and Cohort 3 [Park et al., 2009] (Table I). Lastly, one group used simulated data with known results in order to compare the power of three different methodological approaches [Chang et al., 2009].
A key issue in the analysis of FHS data is the modeling of the non-independence amongst relative pairs. Four of the eight Group 14 contributions directly modeled the family structure of the data [Kerner and Muthén, 2009; Luan et al., 2009; Park et al., 2009; Zhu et al., 2009]. Zhu et al.  split the families into sibships to reduce computational complexity. Three groups selected only unrelated individuals [Fradin and Fallin, 2009; Roslin et al., 2009; Yan et al., 2009].
Integrating longitudinal data analysis with genome-wide SNP data caused computational complexities and challenges that were addressed by the different groups in unique ways (Table I). Several groups selected only SNPs or chromosomes with prior evidence of association to the trait of interest to demonstrate their approach to longitudinal data [Chang et al., 2009; Fradin and Fallin, 2009; Kerner and Muthén, 2009; Luan et al., 2009; Yan et al., 2009]. Three groups used genome-wide SNP data [Park et al., 2009; Roslin et al., 2009; Zhu et al., 2009]. The traits included in the analysis covered a wide range of metabolic and cardiovascular traits measured on FHS participants including systolic blood pressure (SBP), type 2 diabetes mellitus, coronary artery calcification, body mass index (BMI), weight, metabolic syndrome (MBS), high-density lipoprotein levels (HDL), low-density lipoprotein levels (LDL), fasting triglyceride levels (TG), and coronary artery disease (CHD) (Table I). The data were integrated either in a one-step approach, in which genotype and phenotype data were modeled simultaneously, or in a two-step approach that first modeled the phenotype and then integrated the genotype information in an independent analysis (Table I). All groups incorporated covariates into their analyses.
Several computer programs were used for the analysis including SAS [Chang et al., 2009; Fradin and Fallin, 2009; Luan et al., 2009; Park et al., 2009], Mplus [Chang et al., 2009; Kerner and Muthén, 2009; Roslin et al., 2009], PLINK 9 [Roslin et al., 2009], GOLDENHELIX [Kerner and Muthén, 2009], Stata [Luan et al., 2009], and MASAL [Zhu et al., 2009] (Table I).
The specific scientific focus of each group varied, with two overarching themes: more precise definition of the genetic phenotype in a longitudinal context and application of various statistical approaches for assessing the relationship between genes and longitudinal phenotypes. The particular topics included assessment of changes in cross-sectional genetic effects over time, bias reduction in control selection using incidence density sampling, direct incorporation of genetic data in longitudinal modeling approaches, and the exploration of population heterogeneity with regard to longitudinal trajectories.
Because every group used a different analytical approach and a unique design to answer a very specific scientific question, a comparison of the methods and the results across projects was not feasible. We therefore will describe each study separately highlighting the significant results.
Park et al.  focused on the comparison of genetic association results for the synthetic trait MBS at multiple visits over time via cross-sectional association analyses. The investigators found that the significance of the SNP association increased over time. They were able to replicate some of the significant associations previously reported by FHS and identified new loci as well (Table II). They emphasized that attention to timing of case-control analysis and to consistency in trend of effects over time are important advantages of longitudinal data.
Fradin and Fallin  compared two commonly employed methods of selecting age-matched controls for nested case-control studies, case exclusion sampling (CE), in which both prevalent and incident cases are included but only those free of disease at censoring are used as controls, and incidence density sampling (ID), in which only incident cases are included and controls are selected from all participants remaining at risk for the disease at the time a case occurs. The latter allows for those who develop disease at a later time to act as controls for cases occurring earlier in the study, which has been shown to be less biased than the case-exclusion approach in other areas of epidemiology. They used both the FHS data as well as simulated data and demonstrated that ID sampling was indeed associated with less bias in odds-ratio estimates than CE, although CE appears to be more powerful due to the upward bias of point estimates. They concluded that the ID sampling was an appropriate option for nested case-control genome-wide association studies and that this design could be a very efficient approach to obtain unbiased estimates of relative risk associated with genetic variants, especially when age is a strong risk factor for a disease phenotype. However, the small number of incident cases in the study limited this approach.
Yan et al.  focused on the population effect of candidate polymorphisms for CHD in the FHS Cohort 2 and the change of the population effect over time. They estimated the time-dependent PAF for each SNP using the Cox proportional hazards model and then estimated the PAF for all significant SNPs combined. Because the current PAF estimation does not account for age-of-onset data, this group was able to extend the current PAF analysis and to create a more comprehensive estimate of population impact over the life-course of disease by incorporating the age of onset of CHD into the formulation of the hazard function. They also explored the association with the risk score, which was constructed by summing the number of risk alleles across three CHD susceptibility SNPs (rs1333049 close to the CDKN2A/2B gene, rs618675 in the GJA4 gene, and rs1376251 in the TAS2R50 gene) out of 23 that had been significantly associated with the incident CHD (Table II). The risk score was significantly associated with incident CHD risk (p=0.0004). They concluded that this novel tool for population impact may improve the understanding of genetic risk factors at the population level.
Zhu et al.  explored trajectories as genetic phenotype in a genome-wide modeling approach in independent individuals using MASAL [Zhang, 1997; Zhang, 2004]. This approach uses a nonparametric regression (forward/backward regression) to estimate regression coefficients that estimate gene, environment, time, and interaction effects. This method can accommodate time-varying covariates, and test for interactions between genes and environmental factors as well as between time and covariates. The group identified 13 significant SNP associations for Cohort 2 and 6 SNP associations for the Cohort 1 (Table II), as well as significant SNP-SNP and SNP-environment interactions. However, the significant SNPs in Cohort 1 and 2 samples did not overlap.
Luan et al.  took a candidate gene approach to the longitudinal data, estimating the SNP effects on BMI and weight over time. The group fitted a mixed effects model and compared the intercepts and slopes of the growth curves in groups stratified by the genotypes of known risk alleles. This analysis confirmed the effect of the risk alleles of the SNPs in cross-sectional data (intercept results) and demonstrated a significant effect of the SNPs on the slopes of the developmental trajectories as well.
Roslin et al.  attempted to incorporate the information provided by the longitudinal design into genome-wide SNP association analyses. For this approach they first estimated intercepts and slopes for four different traits in separate multivariate growth models of: 1) SBP, 2) HDL, 3) LDL, and 4) TG. Due to the computational resources required to process data for millions of SNP genotypes, the group followed a two-step design, by first estimating the individual intercepts and slopes of the multivariate growth curves and then regressing these (scalar) features on the SNP genotypes. Slopes and intercepts were regarded as separate traits in eight separate linear regression genetic analyses. The group found a significant association of SNP rs599839 on chromosome 1p13 with the intercept of the trait LDL but not with the slope. Marker rs765547 was associated with both the HDL slope and the TG slope (Table II). This marker is located about 41 kb downstream from the gene lipoprotein lipase (LPL) on chromosome 8p21.
Chang et al.  used the genotype information of previously associated SNPs to predict latent class growth curves and latent class membership (the probability that a genotype affects the probability of the trajectory component membership). In order to accomplish this goal, they compared three different approaches, 1) the likelihood ratio test statistic (LRTS), 2) a direct test of genetic model coefficients, and 3) the chi-square test classifying subjects based on the trajectory model's posterior Bayesian probability. The group found that the LRTS was not usable due to non-normal distribution of the outcome and non-independence of the individuals in the study. The other two tests were satisfactory. Power was still substantial when markers near the gene rather than the gene itself were used. For markers near the actual gene, there was somewhat greater power for the direct test of the coefficients and lesser power for the posterior Bayesian probability chi-square test. Time-varying covariates did not increase overall power due to instability.
Kerner and Muthén  explored heterogeneity in the data with respect to SBP trajectories over time using GMM in Mplus. In order to incorporate genotype information into the model, they took a two-step approach. First, they estimated the class membership probability based on SBP development over time and then they used the latent class membership probability as phenotype in a quantitative trait analysis. This approach allowed for a more homogeneous classification of individuals with regard to the phenotype and it facilitated the identification of an association with a rare coding variant in a subgroup of individuals. They concluded that GMM could be a useful approach to phenotype heterogeneity in mixed age samples for traits with strong age effects.
GAW16 participants revisited the FHS data for their analysis and participants of Group 14 in particular focused on the longitudinal aspect of the data. In previous workshops, the longitudinal focus had been on the use of intercepts and slopes as phenotypes in genetic analysis [Gauderman et al., 2003]. This year's contributions extended these approaches to include a variety of research questions for which longitudinal data could be useful. This exercise revealed several challenges, often related to handling large amounts of phenotype and genotype data and to limited numbers of events for many phenotypes. As the amount of genetic data in longitudinal studies available to researchers continues to increase, the challenges in attempting to incorporate longitudinal data analysis with genetic analysis will become increasingly critical.
Advantages of longitudinal data analysis include the possibility of finding more homogenous groups of individuals that share a common trajectory. Under this scenario, mean effects appear to be easier to detect than slopes or other trajectory features, and the longitudinal approach, borrowing information on individuals across time, may improve precision over analyses of a single visit. Longitudinal data analysis is advantageous for detecting genes that affect trajectories, rather than simple differences in phenotype values because the trajectory information provides a more specific trait for genetic analysis aimed at detecting such genes. Longitudinal data is also useful for identifying genetic causes that have strong age of onset effects because the use of age at disease onset will be a more specific phenotype for genetic analysis in this context than cross-sectional information about disease status. Certain traits have strong heterogeneity with regard to the underlying pathophysiological conditions dependent on the age of onset. SBP is a classic example. High blood pressure early in life often has its cause in renal pathology, whereas elevated blood pressure later in life is more commonly caused by cardiovascular changes. Attention to timing and trajectory of phenotype can help to clarify these insights.
Even though the longitudinal cohort study design has clear advantages for some questions, there are limitations. For example, analyses that focused on incident cases in cohort studies were underpowered due to a small number of events. Further, the results presented within this analysis group show the current lack of clear analytic strategies to deal with the complex longitudinal data structures. Not a single group was able to capitalize on all the data available: especially the integration of the large number of genotypes in the context of both repeated measures and family relationship. Future work should focus on the development of analytical methods and computer software that can handle these longitudinal data in the context of other complexities that are often found in cohort studies. The lack of software solutions that can handle millions of data points in a practical amount of time led many groups to a two-step design in which the phenotype data were analyzed separately from the genotype data. For example, the Mplus analyses performed by Kerner and Muthen , which allow inclusion of up to 50 auxiliary variables, would have taken approximately 20,000 days (~ 55 years) for one-step analysis of 500,000 markers on a 64-bit 8-core machine. The two-step solution is far from ideal, however, and potentially biased. Other groups limited the analysis to only a small number of selected SNPs, assumed to be genetic risk factors for a given trait or phenotype. This design is also not ideal because it excludes the majority of SNPs that could potentially be important as risk factors in a particular study sample.
The GAW 16 data sets offered a unique opportunity to explore many approaches to longitudinal data analysis. From the diversity of strategies applied and evaluated in GAW16, especially those applied and evaluated as part of Group 14, it became apparent that there are many important methodological approaches available for implementing longitudinal data analysis. Unfortunately, the wide array of analyses performed in Group 14 did not allow for direct comparison of the different approaches. Future studies could suggest a more targeted effort in order to further evaluate the approaches and their usefulness under certain design conditions. Taken together, Group 14 contributions demonstrate the opportunities provided by longitudinal data and highlight the need for a combination of strategies to implement longitudinal data analysis.
We thank all Group 14 participants for discussion during the GAW16 meeting and the GAW16 organizers for providing this opportunity. We also thank NIH for funding of this important exercise and meeting (R01 GM031575). Dr. Kerner's work on this manuscript was partially funded by NIMH K08 MH074057.
Conflict of Interest: Dr. Kerner is a close collaborator and co-author with Bengt Muthén, who is the creator and distributor of the commercially available computer software program Mplus.