|Home | About | Journals | Submit | Contact Us | Français|
The authors compared dietary pattern methods—cluster analysis, factor analysis, and index analysis—with colorectal cancer risk in the National Institutes of Health (NIH)–AARP Diet and Health Study (n = 492,306). Data from a 124-item food frequency questionnaire (1995–1996) were used to identify 4 clusters for men (3 clusters for women), 3 factors, and 4 indexes. Comparisons were made with adjusted relative risks and 95% confidence intervals, distributions of individuals in clusters by quintile of factor and index scores, and health behavior characteristics. During 5 years of follow-up through 2000, 3,110 colorectal cancer cases were ascertained. In men, the vegetables and fruits cluster, the fruits and vegetables factor, the fat-reduced/diet foods factor, and all indexes were associated with reduced risk; the meat and potatoes factor was associated with increased risk. In women, reduced risk was found with the Healthy Eating Index-2005 and increased risk with the meat and potatoes factor. For men, beneficial health characteristics were seen with all fruit/vegetable patterns, diet foods patterns, and indexes, while poorer health characteristics were found with meat patterns. For women, findings were similar except that poorer health characteristics were seen with diet foods patterns. Similarities were found across methods, suggesting basic qualities of healthy diets. Nonetheless, findings vary because each method answers a different question.
Cluster analysis, factor analysis, and index analysis use distinct statistical approaches to approximate dietary patterns. Experts have recommended comparing these methods in relation to a disease outcome to better understand the different patterns, but such investigation has been limited (1–4).
To address this gap, we designed a comparison of the 3 most common dietary pattern methods—cluster analysis, factor analysis, and index analysis—with colorectal cancer within the same cohort, the National Institutes of Health (NIH)–AARP Diet and Health Study (n = 492,306). To our knowledge, such a comparison has not been done before. We planned 4 analyses: 1) cluster analysis (5), 2) factor analysis (6), and 3) index analysis (7) to separately investigate colorectal cancer risk and 4) a comparative analysis of the 3 approaches. Our goal here is to compare the findings from the earlier work side-by-side, illustrate if/how individuals are categorized into different patterns, and examine the health behavior characteristics within the pattern groups.
Table 1 highlights the different key questions, distinguishing features, and strategies used to assess the risk associated with each method. Cluster analysis and factor analysis are broadly categorized as “data-driven” approaches that derive a posteriori patterns, while index analysis is an “investigator driven” approach that creates patterns based on a priori decisions. Patterns identified via cluster analysis and factor analysis are influenced by the given population and an investigator-driven food-grouping strategy, while index analysis patterns are influenced by an investigator-driven schema and food-grouping strategy. Although cluster analysis and factor analysis are both empirical methods, they are distinct in their approaches. Cluster analysis finds people who share similar frequency patterns for consumption of foods, whereas factor analysis finds foods that are correlated and then scores people based on the degree to which their diets show the same pattern of variation. Index-based analysis, however, imposes an external structure and assesses the degree to which individuals fit within it. Additional details about each method are included below.
In cluster analysis, the k-means cluster analysis methodology identifies aggregates of individuals in multidimensional space, where each food variable constitutes an axis (using the squared Euclidian distances between observations to determine cluster position). Each individual is positioned in space on the basis of intake of numerous foods. Food choices common to all contribute less to cluster formation than those choices made by some and not by others. The interindividual variation of food variables within a defined cluster is no longer considered once the cluster position is established, despite the fact that the variation of intake within some clusters is greater than others. Thus, no differentiation is made among individuals within the same cluster who have somewhat dissimilar diets; that is, there is no gradient.
Factor analysis (or principal component analysis) examines the correlation matrix of food variables and searches for underlying traits (or factors) that explain most of the variation in the data. Thus, a large number of food variables are reduced to a smaller set of variables that capture the major dietary traits in the population. Commonly, the emerging factors are adjusted by using an “orthogonal rotation” so that the final factors are uncorrelated. For each factor, scores are obtained that define the position of each individual along a gradient.
Index-based analysis uses a numerical scoring system defined on the basis of a priori knowledge. Indexes generate scores for different sets of dietary components based on the researcher's scoring approach and interpretation of dietary guidance. The individual components of the index are summed to a total score so that all participants are ranked from the minimum to maximum score. Index-based analysis allows for comparability across cohorts as the scoring is not driven by the specific population. Indexes may differ in design, structure, and interpretation of dietary guidance.
We used data from the NIH–AARP Diet and Health Study, a prospective cohort study designed to investigate diet and cancer. AARP members who were aged between 50 and 71 years and residents of 6 states (California, Florida, Louisiana, New Jersey, North Carolina, Pennsylvania) or 2 metropolitan areas (Atlanta, Georgia; Detroit, Michigan) were contacted in 1995–1996 to participate in the NIH–AARP Diet and Health Study; 18% (n = 617,119) returned the questionnaire. After reviewing surveys with satisfactory dietary data (n = 566,407), we excluded questionnaires completed by proxy (n = 15,760), respondents who reported previous cancer (n = 52,867) or end-stage renal disease (n = 997), and individuals with energy outliers, as defined by a Box-Cox transformation (n = 4,401) (8). Finally, we excluded cluster outliers (n = 76) as determined through cluster analysis, and therefore these analyses included 492,306 people (293,576 men and 198,730 women).
Study participants were followed from enrollment in 1995–1996 through December 31, 2000. Vital status was determined by annual linkage of the cohort to the Social Security Administration Death Master File on deaths in the United States, follow-up searches of the National Death Index for matched subjects, cancer registry linkage, questionnaire responses, and responses to other mailings. Incident cases of cancer were identified by probabilistic linkage between the NIH–AARP membership and 8 state cancer registry databases. In a previous analysis to study the validity of this approach, approximately 90% of all cancers were assessed (9). Further details on study design have been described elsewhere (9). The NIH–AARP Diet and Health Study was approved by the Special Studies Institutional Review Board of the National Cancer Institute.
During follow-up, we identified 3,110 incident colorectal cancer cases (2,151 in men and 959 in women). Cases were invasive and defined on the basis of International Classification of Diseases for Oncology, Third Edition, codes C180–C189, C199, C209, and C260. If multiple cancers were diagnosed in the same participant, we included the colorectal cancer case only if it was the first malignancy diagnosed during the follow-up period.
At baseline, study participants completed a 124-item food frequency questionnaire, an early version of the Diet History Questionnaire, to assess dietary intake over the past year. The Diet History Questionnaire has been calibrated (10, 11), and further validation was done with the AARP food frequency questionnaire and two 24-hour recalls within the NIH–AARP Diet and Health Study (12).
Our methods to identify dietary patterns by use of cluster analysis, factor analysis, and index analysis were the same as those described previously (5–7). Excluding the 76 study participants identified as cluster outliers in the cluster analysis did not change the factor analysis and index analysis findings. To create the clusters and factors, we used 181 food groups based on the 204 food items drawn from the food frequency questionnaire (because line items contain more than one food item, the final number of food items from the 124-item food frequency questionnaire was 204). We energy adjusted the food groups (expressed as grams per day) by dividing the intake of each food group by total energy and multiplying by 1,000 and then standardized these variables to a mean of 0 and standard deviation of 1. To construct index scores, we used the food group and nutrient variables from the food frequency questionnaire.
Wirfält et al. (5) identified 4 large and stable clusters for men (many foods, vegetables and fruits, fatty meats, fat-reduced foods) and 3 large and stable clusters for women (many foods, vegetables and fruits, diet foods/lean meats). For both men and women, smaller clusters were found (<10,000 individuals), but these were characterized by very specific foods and therefore were not included. Flood et al. (6) identified 3 factors for men and 3 similar factors for women: fruits and vegetables, fat-reduced and diet foods, and meat and potatoes. Reedy et al. (7) scored 4 indexes, including the Healthy Eating Index-2005 (HEI-2005) (13), the Alternate Healthy Eating Index (AHEI) (14–16), the alternate Mediterranean Diet Score (alternate MDS, modified for an American diet) (17–19), and the Recommended Food Score (20). The scoring standards are the same for men and women for all indexes except the MDS, which is based on sex-specific median intake.
We used SAS, version 8.1, software (SAS Institute, Inc., Cary, North Carolina) for statistical analyses. We defined clusters, factors, and index scores as described previously (5–7), separately for men and women. We examined the adjusted relative risks and 95% confidence intervals for colorectal cancer risk on the basis of previous analyses for cluster analysis (using the largest cluster, many foods, as the reference category); factor analysis (comparing the highest with the lowest quintiles for factor scores on each factor, quintile 5 vs. quintile 1); and index analysis (comparing the highest with the lowest quintiles for each index score, quintile 5 vs. quintile 1). We calculated the percentage of men and women from each cluster in the highest and lowest quintiles of each factor and index score. Finally, we compared health behavior characteristics for men and women in key clusters, the highest quintile for each factor, and the highest quintile for each index. The variables that we compared were as follows: energy intake (kilocalories); protein (all nutrients based on grams or milligrams per 1,000 kcal); total fat; carbohydrate; calcium; dietary fiber; folate; body mass index (18.5–24.9, 25–29, 30–34, 35–39, ≥40 kg/m2); education (less than high school, high school, some college, college graduate); smoking (never smoker, former smoker of ≤1 pack per day, former smoker of >1 pack per day, current smoker of ≤1 pack per day, current smoker of >1 pack per day); and physical activity (≥20 daily minutes reported rarely or never, 1–3 times per month, 1–2 times per week, 3–4 times per week, ≥5 times per week).
Table 2 presents adjusted relative risks and 95% confidence intervals for colorectal cancer for men and women based on previous cluster analysis, factor analysis, and index analysis (5–7). In men, the vegetables and fruits cluster, fruit and vegetables factor, fat-reduced/diet foods factor, and all indexes (HEI-2005, AHEI, MDS, Recommended Food Score) were associated with reduced risk for colorectal cancer; the meat and potatoes factor was associated with increased risk. In women, a significantly reduced risk was found with the HEI-2005, and an increased risk was found only with the meat and potatoes factor.
Table 3 examines the percentage of men and women from each cluster in the highest quintile of each factor and index score. Fifty-seven percent of men in the vegetables and fruits cluster were classified in the highest quintile of the fruits and vegetables factor, and 48% of men in the vegetables and fruits cluster were in the highest quintile of the HEI-2005. Although 86% of men in the fat-reduced foods cluster were in the highest quintile of the fat-reduced/diet foods factor, only 37% were in the highest quintile of the HEI-2005. Again, 86% of men in the fatty meats cluster were in the highest quintile of the meat and potatoes factor, and 5% were in the highest quintile of the HEI-2005.
For women, 48% and 41% of the vegetables and fruits cluster were also in the highest quintiles for the fruits and vegetables factor and the HEI-2005, respectively. Forty-three percent of women in the diet foods/lean meats cluster were classified in the highest quintile of the fat-reduced/diet foods factor, and just 22% of women in the diet foods/lean meats cluster were also in the highest quintile of the HEI-2005.
Table 4 presents the converse relation—specifically, the percentage of men and women from each of the clusters in the lowest quintile of each factor and index score. For men, the classification in the lowest quintiles appears to be clearer than those for the highest quintiles, as just 1% of the men in the vegetables and fruits cluster are also in the lowest quintiles for the fruits and vegetables factor and the HEI-2005. This percentage is similarly low (0%–1%) with the fat-reduced foods cluster and lowest quintile of the fat-reduced/diet foods factor, as well as with the fatty meats cluster and the lowest quintile of the meat and potatoes factor. This pattern is consistent for women as well. A small percentage (3% and 2%) of those women in the vegetables and fruits cluster is also in the lowest quintile of the fruits and vegetable factor and the HEI-2005, respectively.
Tables 5 and and66 present the demographic and nutrient intake characteristics of men and women by key clusters, factors, and index scores (index scores are represented in the table by only one index, the HEI-2005, but the characteristics were consistent for all indexes; data not shown). The men in the vegetable and fruits cluster, the fat-reduced foods cluster, and the highest quintiles of the fruits and vegetable factor, fat-reduced/diet foods factor, and HEI-2005 have a similar—and generally favorable—nutrient and health behavior profile. In contrast, the men in the fatty meats cluster, the top quintile of the meat and potatoes factor, and the lowest quintile of the HEI-2005 are systematically different from the participants in the other groups. These men had less favorable health profiles; they were less likely to be nonsmokers and to be overweight (reflected in their greater total energy intake and less physical activity). They also reported fewer years of education and had diets that indicated greater consumption of total fat and less calcium, fiber, and folate than those in the other groups.
The women in the vegetable and fruits cluster, as well as the women in the highest quintiles of the fruits and vegetable factor and HEI-2005, also shared favorable health behavior profiles. However, the women in the 2 so-called diet food pattern groups (diet foods/lean meats cluster and the highest quintile of the fat-reduced/diet foods factor) have a generally poor health behavior profile, similar to the women in the highest quintile of the meat and potatoes factor.
Rather than suggesting that one approach is superior, our results demonstrate that findings can vary depending on the methods used to elucidate dietary patterns, because each method is designed to answer a different question. Cluster analysis and factor analysis ask what accounts for the variation in intakes and how well those variances relate to risk, whereas index analysis asks whether variation from a predefined diet relates to risk. Nonetheless, similarities were seen across methods, suggesting some basic qualities of healthy diets.
Overall, we can summarize the evidence regarding dietary patterns and risk as follows: For men, cluster analysis, factor analysis, and index analysis come together to help us understand patterns that can reduce risk—diets rich in fruits and vegetables and diets including lower fat foods—and the evidence for patterns (based on factor analysis and index analysis) that can increase risk—diets defined by a meat and potatoes pattern. For women, the results were less consistent, as only one factor revealed increased risk (meat and potatoes factor) and one index pattern showed decreased risk (HEI-2005).
The different findings between men and women could be due to the greater heterogeneity in women's diets (9), biologic differences, increased measurement error among women (21), differences in how men and women completed the food frequency questionnaire, or other reasons. Additionally, though, we found differences in the health behavior characteristics of men and women in similar-looking patterns that might help to explain why these patterns produced different results. The women in the diet food pattern groups (defined as diet foods/lean meats cluster and fat-reduced/diet foods factor) look like “dieters,” women who are in poorer health/overweight, trying to change their behaviors, or at least report a “good” diet. On the other hand, the men in the diet food pattern groups (defined as fat-reduced foods cluster and fat-reduced/diet foods factor) look “health-conscious.” Thus, the women and men in the diet food pattern groups had dissimilar health behavior characteristics. The “women dieters” had profiles most similar with those individuals in the meat and potatoes factor, and the “health-conscious men” looked more like those in the fruits and vegetables pattern groups and all indexes.
Among men, however, we also saw differences in cancer risk. We did not see the same association with colorectal cancer for men in the fatty meats cluster and the meat and potatoes factor. This may be due to differences in group size, but it also reflects that, even when the health characteristics and nutrient profiles are similar, the actual foods and/or people that make up these patterns differ because they are defined by using different statistical procedures.
In cluster analysis and factor analysis, labels such as “fruits and vegetable factor” are commonly attached to factors and clusters that emerge analytically. However, similar labels can represent meaningfully different patterns. In the analyses presented by Wirfält et al. (5) and Flood et al. (6), the clusters and factors were derived separately for men and women and, despite the similar names, they are not defined by exactly the same foods, nor are they the same as clusters and factors similarly named in other studies. These methods are data driven and dependent on the intake within the population from which they are drawn. Labels help to clarify the discussion of the findings; indeed, we have used labels here regarding “fruits and vegetables,” “diet food,” and “meat” pattern groups. Using labels makes for easier presentation to an audience, but it makes less clear the fact that clusters or factors with similar or identical names may be quite different.
Other comparative work with cluster analysis and factor analysis has focused on the stability and reproducibility of clusters and factors and, to a lesser extent, on the general picture provided by the methods. Research that has compared different methods with a biomarker or health outcome includes comparisons of cluster analysis and factor analysis with plasma lipid biomarkers (22), factor analysis and reduced rank regression with biomarkers of subclinical atherosclerosis (23), and factor analysis and index analysis with plasma sex hormone concentrations (24), mortality (25), and hypertension (26). In related analyses of cluster analysis and factor analysis, Newby et al. (2) also found that some associations were significant for men and not for women (white bread cluster and lower high density lipoprotein cholesterol), some were significant when using factor analysis but not cluster analysis (sweets factor and lower high density lipoprotein cholesterol), and some were similar with cluster analysis and factor analysis (healthy pattern and lower plasma triacylglycerols). Nettleton et al. (4) found that prior information about inflammation included with reduced rank regression strengthened the ability to detect an association (no association was found for factor analysis). Although differences were found in the foods in the patterns, this did not entirely account for the lack of association when using factor analysis. This reinforces the unique information provided by different pattern analysis methods (4).
There have been 3 analyses that have compared index analysis and factor analysis by using different outcomes: Fung et al. (22) found an association with index analysis (higher AHEI score and lower levels of free estradiol) but not factor analysis for plasma sex hormone concentrations; Osler et al. (23) found an association with factor analysis (prudent pattern and all-cause and cardiovascular morality) but not index analysis; and Schulze et al. (24) found no associations with either index analysis or factor analysis for hypertension (although the third of 4 quintiles measured with a Dietary Approaches to Stop Hypertension (DASH) Index was associated with a reduced risk). Although Fung et al. (22) and Schulze et al. (24) postulate that index analysis may provide a stronger ability to find more significant effects on disease risk than factor analysis, this is likely because of the inclusion of relevant, evidence-based components within a given index (22, 24). For example, Fung et al. (22) suggest that the reason they found a relation with the AHEI and not with factor analysis may be due to the emphasis on soy in the index used. However, although an index may include a critical component, it may suffer from dilution if some dietary components are not relevant (24).
Comparisons across the methods are somewhat limited here by our decisions to define our initial food variables. Index analysis used aggregated food groups as used in food-based recommendations. However, both cluster analysis and factor analysis used single foods or minimally aggregated food groups.
Regardless of the food grouping strategy selected, we recommend using energy-adjusted variables—as we did—to account for the energy compositions of the diet rather than using variables that are derived from absolute dietary intakes. This adjustment is suggested because energy needs are determined by body size, age, physical activity, and other factors and also because diet quality is of greater interest rather than absolute intakes. Energy adjustment may also help to reduce measurement error (21), although future work is needed in this area.
The goal with dietary pattern analyses is to examine the multiple dimensions of the diet simultaneously relative to a given outcome. Thus, we consider the best way to operationalize and model the multidimensionality of the total diet. Although cluster analysis, factor analysis, and index analysis are useful and answer different questions, perhaps we should not limit ourselves to these common approaches (25). Other methods hold promise for new ways to explain the complexity of dietary data and would allow us to ask other questions: What combination of foods explains the variation in a set of intermediate health markers (reduced rank regression) (26)? What combination of foods minimizes cancer risk (neural networks) (27)? What features of the diet are most strongly associated with a reduced risk of cancer (classification and regression trees) (28)?
Dietary pattern analyses play a unique role in assessing the relations between diet and disease. Although most research with dietary patterns has been shown to be more strongly related to risk of disease than individual parts of the diet (29), the World Cancer Research Fund Panel stated that there was insufficient evidence to make judgments regarding dietary patterns and cancer risk (30). Our results are consistent with their summaries for specific foods and dietary components and reinforce the Panel's recommendation that additional research be done investigating dietary patterns.
Author affiliations: National Cancer Institute, Bethesda, Maryland (Jill Reedy, Susan M. Krebs-Smith, Victor Kipnis, Douglas Midthune, Arthur Schatzkin, Amy F. Subar); University Hospital Regensburg, Regensburg, Germany (Michael Leitzmann); Lund University, Malmo, Sweden (Elisabet Wirfält); University of Minnesota, Minneapolis, Minnesota (Andrew Flood); World Cancer Research Fund, London, United Kingdom (Panagiota N. Mitrou); University of Cambridge, Cambridge, United Kingdom (Panagiota N. Mitrou); and AARP, Washington, DC (Albert Hollenbeck).
This research was supported by the Intramural Research Program of the National Cancer Institute, National Institutes of Health.
The authors gratefully acknowledge the contributions of Lisa Kahle, Leslie Carroll, and David Campbell at Information Management Services and Tawanda Roy at the Nutritional Epidemiology Branch for research assistance. Cancer incidence data from the Atlanta metropolitan area were collected by the Georgia Center for Cancer Statistics, Department of Epidemiology, Rollins School of Public Health, Emory University. Cancer incidence data from California were collected by the Cancer Surveillance Section, California Department of Health Services. Cancer incidence data from the Detroit metropolitan area were collected by the Michigan Cancer Surveillance Program, Community Health Administration, state of Michigan. The Florida cancer incidence data used in this report were collected by the Florida Cancer Data System under contract to the Department of Health. Cancer incidence data from Louisiana were collected by the Louisiana Tumor Registry, Louisiana State University Medical Center, New Orleans. Cancer incidence data from New Jersey were collected by the New Jersey State Cancer Registry, Cancer Epidemiology Services, New Jersey State Department of Health and Senior Services. Cancer incidence data from North Carolina were collected by the North Carolina Central Cancer Registry. Cancer incidence data from Pennsylvania were supplied by the Division of Health Statistics and Research, Pennsylvania Department of Health, Harrisburg, Pennsylvania.
The views expressed herein are solely those of the authors and do not necessarily reflect those of the contractor (Florida Cancer Data System) or the Florida Department of Health. The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions.
Conflict of interest: none declared.