To elucidate some basic concepts involved in combining dietary reports and biomarkers, we propose a simple model in the form of a causal pathway diagram (Figure ). In the figure, the arrow from dietary intake to biomarker represents our assumption that true dietary intake causally affects the true biomarker level. Consequently, the two are correlated. Inasmuch as the reported intake and the measured biomarker level are correlated with their true values, they will also be correlated with each other.
Figure 1 A-C: Pathway diagrams for three versions of the model. The variables typed in bold font are the observed variables; those in italic font are unobserved. Figure 1a represents the model where the diet effect on disease is not mediated by the biomarker (no (more ...)
The model represented by Figure postulates that dietary intake affects disease through the biomarker and also through other pathways. This is the most general form of our model. For example, consider the potential effect of N-nitroso compounds (NOC) from red meat on colon cancer, with NOC-specific DNA adducts in exfoliated colonocytes as biomarkers of NOC exposure [15
]. NOCs that reach the large intestine have a direct mutagenic effect on the colonic mucosa, resulting in formation of NOC-specific DNA adducts in the colonocytes, whereas absorbed NOCs can have a systematic effect on colonic tissue, acting as tissue specific carcinogens, directly or after metabolic activation [16
We also consider two sub-models. In the first submodel, the biomarker of intake is not a determinant of disease; thus, in Figure the arrow between the marker and disease is absent. For example, levels of urinary 3-methyl-histidine, a marker for red meat intake [17
] are not thought to affect the risk of colon cancer, and would add nothing to the risk model if the true dietary intake were known. In the second submodel, dietary intake does not affect risk except through the biomarker (Figure ), and would add nothing to the risk model if the true biomarker level were known. An example is the dietary carotenoid intake, which is thought to affect skin melanoma entirely through the level of carotenoid in skin tissue [18
When combining dietary reports with biomarkers in searching for nutrition-disease relationships we are using the biomarker in two different ways. Firstly, with regard to the diet-disease pathway not mediated through the biomarker, the biomarker acts as a correlate of dietary intake and helps to improve precision of our measure of dietary intake. Secondly, with regard to the diet-disease pathway through the biomarker, introduction of the biomarker naturally strengthens our ability to detect dietary effects through this pathway.
Finally, we note that Figure does not include the possibility of confounding variables that causally affect the true biomarker level and independently the disease. As noted earlier, individual differences in metabolism and external factors can influence biomarker levels, so the presence of such confounders is a real possibility. We will proceed assuming that such confounding can be controlled for in the analysis, and elaborate on this important problem in the Discussion.
The statistical model
Parallel to the model depicted in Figure , we define a statistical model. This will clarify the assumptions that are being made, and will also form the basis for generating simulated data and thereby studying the gains that can accrue from the combination methods that we will describe.
The model, depicted in Figure , can be represented mathematically by four inter-related statistical regression models:
(i) Biomarker-Diet: relating true biomarker level (TBL
) to true dietary intake (TDI
where the last term is distributed normally with mean zero and constant variance, independently of dietary intake. This part of the model describes the arrow from true dietary intake to true biomarker level in Figures .
(ii) Biomarker Measurement: relating measured biomarker level (MBL
) to true biomarker level;
where the last term is distributed normally with mean zero and constant variance, independently of true biomarker level. This is called the classical measurement error model [19
], and implies that the measured level is an unbiased measure of the true level. This part of the model describes the arrow from true biomarker level to measured biomarker level in Figures .
(iii) Dietary Intake Measurement: relating reported dietary intake (RDI
) to true intake;
where the last term is distributed normally with mean zero and constant variance, independently of true intake. This part of the model describes the arrow from true dietary intake to reported dietary intake in Figures .
(iv) Disease-Diet: relating disease (D
) to true dietary intake and true biomarker level.
In this model, the coefficient α2 represents the effect of the biomarker level on disease, and describes the arrow from true biomarker level to disease in Figure ; the coefficient α1 represents the effect of diet on disease through pathways independent of the biomarker and describes the arrow from dietary intake to disease in Figure . Assuming dietary intake causally affects biomarker level, the total effect of diet is the sum of α1 plus a multiple of α2. Setting α2 equal to zero is equivalent to deleting the arrow from biomarker to disease, as in Figure . Setting α1 equal to zero is equivalent to deleting the arrow from dietary intake to disease, as in Figure .
The main statistical assumptions underlying this four-part model and implied by Figure are as follows.
1. Measurement errors in dietary intake are independent of disease, that is, non-differential.
2. Measurement errors in biomarker level are non-differential.
3. Measurement errors in dietary intake and in biomarker level are independent of each other. This seems reasonable since reporting errors are mostly cognitive whereas biomarker errors are mostly related to physiology or to laboratory conditions.
4. Any confounders of the biomarker-disease relationship and of the dietary intake-disease relationship have been controlled for (and are thus omitted from Figure ). This is the strongest assumption, and we elaborate on it in the Discussion.
The assumptions regarding linearity of the regression models are not central to the main argument in this paper. If any of the regressions is non-linear then the dietary intake or biomarker level may be replaced by an appropriately transformed variable that will conform more closely to a linear relationship. Such transformation would not substantially change the results regarding statistical efficiency reported here.
Statistical Methods of Relating Self-reported Intake and Biomarker Level to Disease
We assumed that: each cohort participant provides a self-reported dietary intake, a related biomarker measurement, and a binary disease outcome; self-report and biomarker values are transformed, if necessary, so their distributions are approximately normal; and relationships between dietary intake and disease are to be investigated using logistic regression. We considered 5 analytic approaches; the last three represent different ways of combining self-report and biomarker.
1. Univariate analysis (i.e. logistic regression with one explanatory variable) of self-reported intake;
2. Univariate analysis of biomarker level;
3. Bivariate analysis (i.e. logistic regression with two explanatory variables) of self-reported intake and biomarker level, testing the joint null hypothesis that the coefficients for self-reported intake and biomarker level are simultaneously zero. This joint hypothesis uniquely represents no association between diet and disease, assuming that dietary intake and biomarker do not affect disease in opposing directions.
4. Howe's method [20
]. The two variables are grouped into k quantile groups, and the score j1 + j2 is calculated, where j1 is a participant's quantile for self-reported intake and j2 the quantile for biomarker level. The score is then used as the explanatory variable in the logistic regression. For k = 5, the range of possible scores is from 2 to 10. We studied the versions of the method with k = 3, 4, 5 and n (the sample size). With the last version, the score is the sum of the ranks of the two variables. We present results for this last version, as it was consistently the most efficient in our simulations.
5. Univariate analysis of the first principal components score. Principal components analysis [21
] is performed on self-reported diet and biomarker level, the first principal component is formed and the scores of the first component are computed for each participant. Logistic regression is then conducted with the score as the explanatory variable. Principal components analysis is conducted on the correlation matrix, and the first principal component is the sum of the reported dietary intake and biomarker level weighted by the inverse of their respective standard deviations.
For simulating data, values of the coefficients in each of the four models described above must be specified, as well as the means and variances of the variables. Our aim was to quantify the potential gains in statistical power from using combined diet-biomarker analyses in realistic situations. We therefore chose two diet-disease hypotheses (to be described), and used results from the literature to determine the parameters for the simulation. The first hypothesis concerned dietary lutein and age-related macular degeneration (ARMD).
There is now considerable evidence that dietary lutein intake could reduce the incidence of ARMD [22
]. Lutein, found in dark green, leafy vegetables is found in the macula and is thought to be protective though its antioxidant functions and as a blue light filter that protects underlying tissue from light damage [22
]. Macular degeneration is an irreversible process that is a major cause of blindness in the elderly, and may be preventable through increased intake of lutein as well as zeaxanthin, by increasing the macular pigment [22
The biomarker that we considered for dietary lutein intake was serum lutein. We considered two possible methods for self-report of lutein intake: a food frequency questionnaire (FFQ) or 6 repeated 24 hour recalls (24 HR). The FFQ is the instrument most commonly used to assess dietary intake in large prospective studies. Multiple 24 HR's are hypothesized to be more accurate than a FFQ [24
] and are becoming more feasible to apply in large studies with the development of computerized versions [25
To choose the parameters for the simulations, we scanned the literature for carotenoid feeding studies [26
], cross-sectional studies of self-reported carotenoid intake and serum carotenoid levels [30
], and epidemiologic studies relating carotenoid intake or serum levels to ARMD [33
]. We also used unpublished data from the OPEN study [34
]. The values of the parameters are shown in Table and their determination is described in Additional File 1
: Appendix, Part A.
Parameters for the Lutein - Age Related Macular Degeneration Model
The second example, beta-cryptoxanthin and stomach cancer, is fully described in Additional File 1
: Appendix, Part B.
We simulated cohort studies with 400 individuals, approximately half developing the disease, the other half remaining disease-free. One may regard these as representing nested case-control studies arising from cohort studies with a low incidence rate. To the data from each study, we applied the five statistical analyses listed previously.
After applying each analysis, we examined (a) whether a statistically significant relationship between disease and exposure was found at the 5% level on a two-sided test, and (b) the estimated RR between the 90th and 10th percentiles of the exposure variable distribution. For RR, Howe's method could not be compared with the other methods.
We examined 6 scenarios, three where the dietary report instrument was a FFQ and three where it was 6 repeats of a 24 HR. Each set of three scenarios comprised a disease risk model where the dietary effect on disease was not mediated through the biomarker (no mediation, as in Figure ), a model of full mediation (as in Figure ), and a model of partial mediation (as in Figure ). For each scenario we simulated 1000 cohort studies.
From the results on each scenario, we estimated statistical power as the proportion of statistically significant results, and the geometric mean of the RRs. We converted differences in statistical power to the ratio of sample size required to that required if a univariate analysis of reported dietary intake were used. This conversion was based on assuming that the test statistics were normally distributed.
Results: Correlations between the exposure variables
The chosen model parameters shown in Table gave rise to correlations between the exposure variables (Table ). True dietary intake (TDI) was most strongly correlated with 6 × 24 HR reported intake (0.68), somewhat less strongly correlated (0.61) with observed serum lutein level (MBL), and least strongly correlated with FFQ reported dietary intake (0.51). True serum lutein level (TBL) was most strongly correlated with observed serum lutein level (0.89), and not very highly correlated with reported dietary intake (0.47 for 6 × 24 HR and 0.35 for FFQ).
Lutein: Correlations Between Measurements Derived From the Chosen Model
Results for scenarios where a FFQ was the dietary instrument are shown in Table . Estimated RRs (between the 90th and 10th percentiles of the measured exposure) were less than one, indicating the protective effect of lutein. For univariate analyses they varied between 0.32 and 0.64 according to the disease risk model and method of analysis. In most cases, the lower the RR in univariate analyses, the higher was the statistical power.
Table 3 Lutein and Age Related Macular Degeneration (ARMD), With Dietary Intake Assessed by FFQ: Standardized Relative Risks (RR*), Statistical Power and Relative Sample Size** (rss) Required for Various Analysis Strategies Under Different Disease Risk Models (more ...)
The univariate analysis of FFQ reported intake was less powerful than that of serum level. This was due to FFQ reported intake having a lower correlation with true dietary intake (r = 0.51) and with true serum level (0.35) than did measured serum level (0.61 and 0.89 respectively) (Table ).
The combination methods generally performed much better than the univariate analysis of FFQ. Whether or not they improved on the univariate analysis of serum level depended on the disease risk model. When there was no mediation through the serum level, combination methods, especially principal components, produced moderate gains over the analysis of serum level alone. When there was partial mediation, principal components was only slightly more efficient than using serum level alone. When there was full mediation, then the univariate serum level analysis was optimal, although the principal components method was not much inferior.
Among the combination methods, principal components and Howe's method performed equally well. Bivariate analysis was less powerful than univariate analysis of serum level in the models with full and partial mediation, and less powerful than principal components and Howe's method in the models with no or partial mediation through the biomarker.
Projected sample size savings compared to univariate analysis of FFQ were substantial. Under full mediation the univariate serum analysis would require only 16% of the sample size needed for a dietary intake analysis, and under partial mediation 32%. Combination methods also gave substantial sample size savings, with the principal components yielding sample sizes between 21% (full mediation) and 52% (no mediation) of that required for univariate analysis of FFQ. In parallel with these sample size savings, observed RRs between the 10th and 90th percentiles were well below 0.5 using univariate serum level analysis or principal components, but above 0.5 for univariate analysis of FFQ.
Results where 6 24 HR's were the dietary instrument are shown in Table . These results show that when the dietary instrument was improved (correlation with true intake = 0.68), the gains from including the serum biomarker were less dramatic but still potentially useful. In the no mediation model, univariate analysis of 24 HR's gave more statistical power than univariate analysis of serum level, but was less powerful than the principal components method. The latter yielded a sample size requirement 77% that of the univariate analysis of 24 HR's.
Table 4 Lutein and Age Related Macular Degeneration (ARMD), With Dietary Intake Assessed by 6 24 HR's: Standardized Relative Risks (RR*), Statistical Power and Relative Sample Size** (rss) Required for Various Analysis Strategies Under Different Disease Risk (more ...)
When there was partial or full mediation through the serum level, then the power gains from univariate analysis of serum level and from the combination methods were substantial, with reduction of sample size to 30%-50%, relative to analysis of 24 HR's. However, in these models the combination methods did not perform better than the univariate serum level analysis.
The results for β
-cryptoxanthin and stomach cancer were quite similar to those shown in Tables and , and are described in Additional File 1
: Appendix, Part B.
Comments on the application of the method
The principal component or Howe's score has no recognized units, the first being a sum of two standardized scores, the second a sum of two rank scores. Other nutritional measures, such as "prudent diet" scores or the Healthy Eating Index, share this property. We propose that the principal components score be used as a more efficient first means of establishing the existence of a nutrition-disease relationship. Analyses that explore in more depth the relationships between dietary intake, biomarker level and disease risk will be motivated by such a positive result.
Markers that will be potentially useful in combination with dietary reports are those demonstrated in controlled feeding studies to be quantifiably modified by changes in diet. A causal relationship between marker and disease then implies that dietary intake will also affect disease, making it acceptable to combine the two measures. The level of the correlation between biomarker and reported dietary intake need not be high. In fact, from simulations not reported here, it appears that biomarkers are likely to be most helpful when reported intake is a poor measure of true intake, and in this situation the reported intake will also have low correlation with the biomarker. What is important is that the biomarker has a correlation with true intake that is similar, or preferably higher, than the correlation between reported intake and true intake. Some notion of whether this is so may be available from controlled feeding studies.
Another helpful characteristic is that the biomarker is not known to be affected by risk factors for the disease. This is related to the assumptions implicit in Figure . If there were risk factors that affected the biomarker, then the biomarker-disease association would be at least partly indirect. Factors that affect the metabolism of the dietary constituent or interact to change the biomarker levels, such as hypo-absorption or oxidative stress, may also have independent effects on the disease, and could thereby confound the diet-biomarker-disease relationship. In the worst case, modifying the biomarker level through diet change would not affect disease. The problem here is the familiar one of confounding that has been a consideration in previous studies of biomarkers and disease. In the event that a strong risk factor for the disease is known to affect the marker, that risk factor should at least be included in the disease risk model so as to avoid ascribing its effect as nutritional. For example, there is now some evidence that beta-cryptoxanthin is negatively associated with smoking [35
], and smoking is a known risk factor for stomach cancer [36
]. Thus, one should include smoking in the model linking disease to a combined measure of dietary intake/serum level of beta-cryptoxanthin. Thus, a price to pay for using a combined dietary intake-biomarker measure is the extra care needed in considering confounding factors since these could enter both through confounding with self-reported intake or through confounding with the biomarker level, and the uncertainty over whether introducing the biomarker has actually introduced an unwelcome confounder alongside the extra information on dietary intake. Here again, the higher is the correlation between the biomarker level and true intake, the more likely is the success of our proposed analytic strategy. For example, our analysis of the literature indicates an encouragingly high correlation of 0.69 between serum lutein and true dietary lutein intake (Table ).
The practicality of including biomarker measurements in all participants in a large cohort study needs considering. As mentioned in the introduction, collecting biological samples from participants is no longer uncommon and their uses are manifold. Thus, while sample collection can be extremely expensive, the proposed approach may be feasible for studies with an already established "biobank". Furthermore, the sample size savings shown in Tables - indicate that adding a biomarker could lead to a two- to five-fold decrease in required sample size, which may partly offset the extra cost of collecting the specimens. Note that the analytic cost of the bioassays need not be prohibitive if analyses are based on a nested case-control design.