|Home | About | Journals | Submit | Contact Us | Français|
Dietary pattern analysis is receiving increasing attention as a means of summarising the multi-dimensional nature of dietary data. This research aims to compare principal component analysis and cluster analysis using dietary data collected from young women in the UK.
Diet was assessed using a 100-item interviewer-administered food frequency questionnaire. Principal component analysis and cluster analysis were used to examine dietary patterns.
6125 non-pregnant women aged 20 to 34 years
Principal component analysis identified two important patterns: a ‘prudent’ diet, and a ‘high-energy’ diet. Cluster analysis defined two clusters, a ‘more healthy’ and a ‘less healthy’ cluster. There was a strong association between the prudent diet score and the two clusters, such that the mean prudent diet score in the less healthy cluster was −0.73 standard deviations and in the more healthy cluster was +0.83 standard deviations; the difference in the high-energy diet score between the two clusters was considerably smaller.
Both approaches revealed a similar dietary pattern. The continuous nature of the outcome of principal component analysis was considered to be advantageous compared with the dichotomy identified using cluster analysis.
Traditionally diet has been assessed by the measurement of nutrient intake, but dietary pattern analyses (for example multiple dietary components expressed as a single exposure) may offer some benefits by summarising diet using a smaller number of variables. In particular, dietary patterns may be potentially of greater relevance than nutrients for the education of the general public on healthier eating. Furthermore, the intakes of some foods are so highly correlated that it becomes difficult to examine their effects separately; dietary pattern analyses instead use such collinearity to advantage. Dietary patterns may also take account of the more potent effect of one food when eaten in combination with another. Factor analysis and cluster analysis have been mostly commonly reported as a posteriori approaches to dietary pattern analysis (Hu, 2002).
Factor analysis is a generic term that includes principal component analysis (PCA). Principal component analysis (Joliffe and Morgan, 1992) is a statistical technique that produces new variables that are uncorrelated linear combinations of the dietary variables that maximise the explained variance. Factor analysis identifies common underlying patterns of food consumption (Hu, 2002), and is typically based on PCA in the field of dietary pattern analysis. PCA and other forms of factor analysis result in a summary score for each participant, for each pattern defined.
Kant (2004a) provides a review of dietary patterns defined by factor analysis and their associations with health outcomes. Kant notes that many studies using factor analysis have identified a pattern with higher fruit, vegetable and whole grain intake, and have chosen to name this the ‘prudent’ dietary pattern; a second theme in the literature is a pattern commonly labelled ‘Western’, indicating high fat, meat and refined grain consumption.
Associations between higher prudent diet scores and reduced all-cause mortality (Osler et al, 2001), coronary heart disease (Hu et al, 2000; Fung et al, 2001), diabetes (Williams et al, 2000) and colon cancer (Slattery et al, 1998; Fung et al, 2003) have been seen, whilst higher Western diet scores have been associated with more coronary heart disease (Hu et al, 2000; Fung et al, 2001), diabetes (van Dam et al, 2002) and colon cancer (Slattery et al, 1998). However, some studies have found no associations between eating patterns defined by PCA and disease outcomes, and inconsistencies between studies are apparent (Newby and Tucker, 2004b).
Using British data Whichelow and Prevost (1996) identified four principal components: ‘fruit, salad, vegetable’, ‘high-starch foods, vegetables, meats’, ‘high-fat foods’ and ‘sweets, biscuits, cakes’. The ‘fruit, salad, vegetable’ pattern was inversely associated with all-cause mortality, whilst in women the ‘high-fat foods’ pattern was positively associated. Amongst four principal components identified by Williams et al (2000), the first was considered a ‘healthy balanced diet’, and was inversely associated with obesity, glucose intolerance and other features of the metabolic syndrome. In these British data a more healthy dietary pattern was consistently seen, that could be labelled ‘prudent’; however, the Western pattern was not so clearly reproduced.
Cluster analysis is an alternative technique of dietary pattern analysis used to identify mutually exclusive, homogeneous groups of participants, based on their eating habits. Kant (2004a) also reviews dietary patterns defined using cluster analysis. Many studies identified a more healthy cluster, characterised by high fruit, vegetable, whole grain and fish consumption. Eating patterns derived using cluster analysis have been associated with oesophageal and stomach cancer (Chen et al, 2002), the metabolic syndrome (Wirfält et al, 2001), and bone mineral density (Tucker et al, 2002).
Using British data, Margetts et al (1998) found two clusters, describing groups of participants with ‘more healthy’ and ‘less healthy’ diets. The National Diet and Nutrition Survey of British people aged 65 years and over was analysed by Pryer et al (2001a). Cluster analysis identified three clusters in men and women separately, including a ‘healthy’ cluster in each. The Dietary and Nutritional Survey of British Adults was similarly analysed by performing cluster analyses on men and women separately (Pryer et al, 2001b); both a ‘healthy’ cluster and a ‘healthier but sweet’ cluster were found amongst men and women. Seven clusters were identified in the UK Women’s Cohort Study (Greenwood et al, 2000).
There is limited information on the comparability of dietary patterns defined by factor analysis and cluster analysis in the same population. In studies of older US men and women patterns defined by cluster and factor analysis were found to be comparable in predicting mortality (Kant et al, 2004b) and in relation to plasma biomarkers (Newby et al, 2004a). Amongst European men and women aged 60 years or older three clusters were compatible with ‘vegetable-based’ and ‘sweet-fat dominated’ principal components (Bamia et al, 2005), and a PCA-derived Mediterranean dietary score was much higher in one cluster than two others, in a Greek population with a mean age of 53 years (Costacou et al, 2003).
There is a need for better understanding of the comparative value of factor analysis and cluster analysis. In particular, the use of both methods within one dataset facilitates greater knowledge of the similarities and differences between the techniques. This paper reports a comparison of factor analysis and cluster analysis using dietary data from a large, contemporary sample of young women in the Southampton Women’s Survey. The robustness of the two techniques is investigated and they are evaluated as methods of characterising diets in this survey.
The Southampton Women’s Survey has measured the diet, body composition, physical activity, hormone levels and social circumstances of a large group of non-pregnant women aged 20 to 34 years living in the city of Southampton, UK (Inskip et al, 2006). Women were recruited through general practices across the city. Each woman was sent a letter inviting her to take part in the survey, followed by a telephone call when an interview date was arranged. In total, 75% of all women contacted agreed to take part in the survey. Trained research nurses visited the women at home and collected detailed information about their health and lifestyles. Data were directly entered onto laptop computers wherever possible. Diets were assessed using a 100-item validated food frequency questionnaire (FFQ) (Robinson et al, 1996). Data are presented here from the first 6129 women who were recruited from practices in the western half of the city between April 1998 and June 2000. The Southampton Women’s Survey was approved by the Southampton and South West Hampshire Local Research Ethics Committee.
For the principal component and cluster analyses the 100 foods and food groups listed in the FFQ were grouped into 49 broader food groups by combining items of similar nutrient composition and comparable usage. For example, carrots, parsnips, swedes and turnips were included in the ‘root vegetables’ group; bacon, ham, corned beef, meat pies and sausages were included in the ‘processed meats’ group.
Average daily nutrient intakes for each woman were calculated by multiplying the nutrient content of a standard portion of each food (Holland et al, 1991b; Holland et al, 1988; Holland et al, 1989; Holland et al, 1991a; Holland et al, 1992a; Holland et al, 1992b; Holland et al, 1993; Chan et al, 1994; Chan et al, 1995; Chan et al, 1996; Ministry of Agriculture, Fisheries and Food, 1993; Davies and Dickerson, 1991) by her reported frequency of intake. Nutrient intakes from alcohol but not from dietary supplements were included in these analyses. Nutrient intakes were log-transformed to normality for statistical analysis. Energy-adjusted nutrient intakes were calculated according to Willett’s residual method (Willett, 1998).
Women were asked to give a blood sample during the second half of their menstrual cycle. Immediately after collection the samples had a full blood count performed and red cell folate was assessed by microparticle enzyme immunoassay. Measurements are available for 3981 (65%) of the analysis sample.
Principal component analysis was performed on the reported weekly frequencies of consumption of the 49 foods and food groups, based on the correlation matrix in order to adjust for unequal variances of the original variables. Principal component analysis was preferred to other factor analysis methods because whilst factor analysis indicates axes of high variation, principal components are designed to indicate the independent axes along which participants vary the most, and the ability to differentiate between individuals is an attractive feature of a dietary score. Furthermore, preliminary analyses indicated that no improvement in interpretability was seen resulting from the conversion from PCA scores to other types of factor scores (data not shown).
Each component derived using principal component analysis provides a summary score for every participant. The scores were transformed using Fisher-Yates normal scores (Armitage and Berry, 2002). These have the effect of mapping the scores onto a Normal distribution with a mean of 0 and a standard deviation of 1.
The 49 foods and food groups were standardised to z-scores before cluster analysis so that they had equal weights when distances were computed (Wishart, 2001). Analysis commenced using Ward’s method (increase in sum of squares) to generate initial clusters, as recommended by Milligan and Cooper (1987). The tree diagram resulting from this hierarchical procedure was used to decide upon the number of clusters. K-means analysis based on squared Euclidean distances, also recommended by Milligan and Cooper, was employed as a further iterative procedure.
For ‘data-derived’ methods such as PCA and cluster analysis, results for one participant may be affected by others in the study, and it is therefore important that they are robust to such influences. Three methods of assessing this were used. Firstly, a participant reporting consumption of any of the 49 foods or food groups greater than 6 standard deviations in magnitude was defined as an outlier, and the analysis was repeated on the dataset with outliers removed. Secondly, the analyses were repeated on a randomly selected half of the dataset and the resulting PCA scores compared to those generated by analyses on the full dataset. Finally, following the example of McCann et al (2001), the results from PCA and cluster analysis on the original 100 foods and food groups were compared with those from the 49 combined groups.
Statistical analysis was performed in Stata 8.0 (StataCorp, 2003), and ClustanGraphics 5 (Wishart, 2001). Comparisons between two continuous variables used Pearson’s correlation coefficient. T-tests were used to compare the means of continuous variables in two groups.
Of the first 6129 women recruited to the SWS, 6125 provided complete dietary information. These 6125 women form the analysis sample in this paper; their characteristics have been described previously (Robinson et al, 2004). The mean age of the women was 27.8 years, and 33% of them were smoking at the time of interview. 45% of the women lived with at least one child in the home, and 56% had A-levels or equivalent educational qualifications, or higher. The women were found to be representative of the population of the UK in terms of ethnicity and smoking. The sample included women of each social class with a wide range of deprivation levels.
The women’s nutrient and fruit and vegetable intakes and red cell folate levels are summarised in Table 1.
After principal component analysis of the data, the first two components explained 7.6% and 7.0% of the variation in the original data respectively, whereas subsequent components explained only 3.8% or less of the variation. The first two components were also found to be the most interpretable and robust (data not shown); their coefficients are presented in Table 2.
Component 1 was characterised by high intakes of fruit and vegetables, wholemeal bread, rice and pasta, yoghurt and breakfast cereals, and low intakes of chips and roast potatoes, sugar, white bread, red and processed meat, full-fat dairy products, crisps, Yorkshire puddings and savoury pancakes, confectionery, tea and coffee, tinned vegetables, cakes and biscuits, and soft drinks. Component 1 was termed the ‘prudent’ diet component, and a prudent diet score was calculated for each woman. Pearson’s correlation coefficient between the prudent diet score and energy intake was −0.20 (P < 0.0001). High prudent diet scores in this cohort have previously been seen to be associated with higher educational attainment, not smoking, spending less time watching television, currently dieting to lose weight, older age, taking strenuous exercise and not sharing the home with children (Robinson et al, 2004).
Component 2 was characterised by high intakes of fruit and vegetables, puddings, meat and fish, eggs and egg dishes, cakes and biscuits, full-fat spread, cooking fats and salad oils, and potatoes. It is notable that all coefficients for component 2 are positive. Whilst there are some similarities between component 2 and the Western diet described in the literature (Slattery et al, 1998), component 2 appeared more to reflect overall intake, and indeed Pearson’s correlation coefficient between the score resulting from component 2 and energy intake was 0.81 (P < 0.0001). Therefore component 2 was termed the ‘high-energy’ diet component, and a high-energy diet score was calculated for each woman.
The tree diagram resulting from Ward’s method of cluster analysis clearly suggested that two clusters were the most appropriate way of grouping the participants. After refinement of these clusters using K-means (Milligan and Cooper, 1987), the median intakes of the 49 foods and food groups in each cluster are presented in Table 3. Cluster 1 had notably higher intakes of white bread, full-fat spread, processed meat, roast potatoes and chips, crisps, added sugar, confectionery, and high-energy soft drinks than Cluster 2. Cluster 1 had notably lower intakes of wholemeal bread, breakfast cereals, yoghurt, cheese, reduced-fat spread, cooking fats and salad oils, fruit and vegetables than Cluster 2. Cluster 1 was labelled the ‘less healthy’ cluster, and Cluster 2 the ‘more healthy’ cluster. The average energy intake in the ‘less healthy’ cluster was 9098 kJ/day (2173 kcal/day), and in the ‘more healthy’ cluster was 9190 kJ/day (2195 kcal/day); this difference was not statistically significant (P = 0.18).
It is apparent from a comparison of Table 2 and Table 3 that foods associated with a high prudent diet score, such as wholemeal bread, fruit and vegetables tended to be eaten more frequently by the ‘more healthy’ cluster; foods associated with a low prudent diet score such as white bread, crisps and confectionery were eaten more frequently by the ‘less healthy’ cluster. A comparison of the prudent diet score between the two clusters showed a remarkably strong association, such that the mean prudent diet score in the ‘less healthy’ cluster was −0.73 standard deviations of the prudent diet score, whilst that in the ‘more healthy’ cluster was 0.83 standard deviations (P < 0.0001). The striking difference is illustrated in Figure 1. Of those women in the ‘less healthy’ cluster, 90.5% had prudent diet scores below the median. Of those women in the ‘more healthy’ cluster, 95.6% had prudent diet scores above the median.
The average high-energy diet score in the ‘less healthy’ cluster was −0.20 standard deviations, whereas that in the ‘more healthy’ cluster was 0.22 standard deviations. Although this difference is statistically significant (P < 0.0001) in such a large population, Figure 2 shows that the difference is relatively small compared to the prudent diet score. It is therefore evident that cluster analysis revealed two clusters that were almost a dichotomy of the prudent diet score, but were less strongly related to the high-energy diet score.
Table 4 compares the results of dietary pattern analyses and macro- and micronutrient energy-adjusted intakes. The prudent diet score was positively correlated with protein intake, dietary fibre, the majority of micronutrients and fruit and vegetables. It was negatively correlated with fat intake, particularly saturated fat and cholesterol, but positively correlated with polyunsaturated fat. The high-energy diet score was positively correlated with fat, protein, dietary fibre, all micronutrient intakes except calcium, and fruit and vegetables. It had small negative correlations with total carbohydrate and total sugar intakes.
Diets in the ‘more healthy’ cluster had lower intakes of fat, particularly saturated fat and cholesterol. They had higher intakes of polyunsaturated fats, protein, dietary fibre, the majority of micronutrients, and fruit and vegetables than those in the ‘less healthy’ cluster.
Red cell folate levels were positively correlated with the prudent diet score (r = 0.29, P < 0.0001), but uncorrelated with the high-energy diet score (r = −0.01, P = 0.73). The mean red cell folate in the ‘less healthy’ cluster was 650 nmol/l whereas that in the ‘more healthy’ cluster was 776 nmol/l (P < 0.0001).
482 participants were defined as outliers and were removed from the dataset before PCA and cluster analyses were performed again to assess robustness. This procedure made negligible difference to the principal component scores, giving a correlation of 0.998 for the prudent diet scores, and 0.995 for the high-energy diet scores. After cluster analysis 95.7% of participants were classified into the same clusters. The original cluster analysis virtually dichotomised the prudent diet scores around zero (Figure 1). Further investigation revealed that the alternative cluster analysis without outliers was also virtually dichotomising the prudent diet score, but using a slightly lower cut-point.
Comparing results on a randomly selected half of the dataset with those on the main dataset resulted in a correlation for the prudent diet score of 0.99 and a correlation for the high-energy diet score of 0.98; after cluster analysis 78.2% of participants were classified into the same clusters. Further investigation again revealed that the cluster analysis on the random half of the dataset was similarly successful to the cluster analysis on the whole dataset in dichotomising the prudent diet score, but used a slightly lower cut-point.
The correlation of the prudent diet score using 100 foods and food groups with the prudent diet score using the 49 combined groups was 0.97, and of the high-energy diet score was 0.98. After cluster analyses using the two grouping methods, 90.9% of participants were classified into the same clusters. Cluster analysis using the 100 food groups was again similarly successful to that on the 49 food groups in dichotomising the prudent diet score, but used a slightly higher cut-point.
These investigations indicate that principal component analysis is highly robust. Correlations were 0.97 or above when comparisons were made between original scores and those derived using modifications or samples of the data. Cluster analysis also appeared robust though more than 20% of the participants were classified in different clusters when only a randomly selected half of the data were analysed. Whatever approach was used, the two clusters were generally clearly defined by a cut-point of the prudent diet score.
This study collected data from a large cohort of young non-pregnant women. The sample included women from each social class, with a wide range of educational achievements and living conditions. Strengths of the study were that the data were interviewer-collected and the response rate was good: 75% of the women contacted agreed to take part in the study. The high completion rate of FFQs was achieved by the use of laptop computers for data collection wherever possible. There is concern that FFQs may be subject to bias (Byers, 2004). However, in the context of dietary pattern analysis, Hu et al (1999) showed that a FFQ revealed similar patterns of diet as weighed diet records and that individuals’ scores on both were strongly positively correlated.
Principal component analysis produced two components with a clear interpretation. The first component was termed the ‘prudent diet score’, in line with published data (Slattery et al, 1998; Hu et al, 1999; Fung et al, 2001; Osler et al, 2001); women with high scores had diets in accordance with recommendations from the Department of Health (Department of Health, 1994; Department of Health, 1998) and other agencies. All coefficients for the second principal component were positive, indicating that to obtain a high score a woman would have a generally high food intake. Indeed, Pearson’s correlation coefficient between the second component and energy intake was 0.81, and therefore the second component was termed the ‘high-energy diet score’.
The identification of the first principal component as a prudent dietary score is comparable with other results from British data (Whichelow and Prevost, 1996; Williams et al, 2000) where a more healthy pattern was the first principal component obtained. Similar patterns to the high-energy pattern were labelled ‘high-fat’ by McCann et al (2001) and ‘high energy-density’ by Beaudry et al (1998).
The prudent and high-energy diet score together explain 14.6% of the variation in the 49 food and food groups. Direct comparisons of the proportion of variation explained by a set of components cannot be made across the literature since it is highly dependent on the number of variables entered into a PCA and the number of components retained. However, when the SWS results were compared to analyses with a similar number of variables entered and components retained, the proportion of variation explained by the SWS was highly comparable (data not shown).
Cluster analysis resulted in two distinct groups of participants in the SWS. The diets of women in Cluster 1 appeared to be less healthy than those of women in Cluster 2. Similar ‘more’ and ‘less’ healthy dietary clusters were found in a large survey of English adults (Margetts et al, 1998) and a smaller study of older US lung cancer cases and controls (Tsai et al, 2003). Healthy clusters of British subjects were also identified in the National Diet and Nutrition Survey (Pryer et al, 2001a) and the Dietary and Nutritional Survey of British Adults (Pryer et al, 2001b).
PCA and cluster analysis are both useful approaches to the assessment of dietary patterns, and maximum information may be obtained when different methods are used (Newby and Tucker, 2004b). Strengths and limitations of PCA and cluster analysis are discussed in detail by Michels and Schultze (2005). A commonly cited criticism of the two techniques is that they involve several subjective but important decisions, such as grouping of foods, and possible transformations of variables. Principal component analysis involves decisions about the number of components to retain and their subsequent labelling. Cluster analysis requires choices about the method of clustering and labelling of the clusters. Another disadvantage of PCA and cluster analysis is that they generate patterns based on variation in diet, but there is no guarantee that these patterns will be predictive for a particular health outcome. However, the techniques have the advantage that they are empirically derived, and are therefore not limited by current knowledge. Furthermore PCA and cluster analysis can combine information about all aspects of diet and are based on food intakes, meaning that they may be more relevant to dietary choices than summaries involving nutrients.
Comparison of the results of the dietary pattern analyses revealed that the more healthy cluster had a much higher average prudent diet score than the less healthy cluster; in fact the two clusters were almost a dichotomy of the prudent diet score, but were less strongly related to the high-energy diet score. These results are similar to those of Costacou et al (2003) and Bamia et al (2005) who compared principal component analysis with cluster analysis. Costacou’s first principal component, which resembled a Mediterranean diet, was considerably higher in one cluster than in two others. Bamia’s first component was labelled ‘vegetable-based’, and was much higher in cluster A than in clusters B and C.
The analyses of robustness in the SWS indicate that differences in results from two similar cluster analyses may be due to a dichotomy of the prudent diet score at a different point, thus highlighting the relationship between results using the two techniques. Newby and Tucker (Newby and Tucker, 2004b; Newby et al, 2004a) also note there is evidence that underlying eating patterns are revealed by both principal component analysis and cluster analysis.
In the SWS the two cluster solution was clearly indicated. In this context, since the two techniques reveal similar patterns, the question of whether PCA or cluster analysis is preferable may be similar to that of a researcher with the choice to analyse body mass index as a continuous variable, or to use a cut-point to dichotomise it into an overweight and a not-overweight group. In many circumstances it may be less informative to use the dichotomous variable than the continuous variable in an analysis. A solution with more than two clusters could be more informative, but was not indicated as the most appropriate way of clustering the participants by a tree diagram. A continuous score resulting from PCA can be particularly useful, and characterising an individual’s diet using a continuous score might be considered a more pragmatic choice than assignment to one of a number of discrete categories. PCA also gives the opportunity to explore more than one dimension of variation in diet, such as the high-energy diet, which was not revealed by cluster analysis in this study. PCA appeared to be somewhat less sensitive to outliers and differing groupings of the foods, and the coefficients altered little when only half of the dataset was analysed.
All dietary pattern analysis techniques have a contribution to make and may reveal similar patterns. In the context of the SWS, PCA was seen to be particularly valuable as a general discriminatory tool. Other studies are needed to replicate these results, particularly the close association between the two-cluster solution and the prudent diet score. The robustness of PCA and cluster analysis, and their associations with other measures of diet, are important considerations when assessing the usefulness of each technique. With the development of robust, meaningful dietary pattern analysis techniques it should be possible to understand better the role of dietary patterns in health and disease.
The authors are grateful to the General Practitioners in Southampton who made this study possible, to the Southampton Women’s Survey staff, and particularly to the women of Southampton who generously gave their time to be part of the study.
Contributors: SRC was responsible for statistical analysis. SMR designed the food frequency questionnaire and SEB was responsible for data collection and processing of the dietary data. HMI coordinated all aspects of the survey. All authors contributed to the interpretation of the data and the preparation of the manuscript.
Sponsorship The study was funded by the Dunhill Medical Trust, the University of Southampton and the Medical Research Council.
Guarantor: SR Crozier