The data used in this study were from the NHIS, publicly available through the Centers for Disease Control and Prevention (CDC). The NHIS is a nationally representative cross-sectional household survey covering the non-institutionalized civilian population in the United States (U.S.) and is conducted annually [
28]. Households and non-institutional sample units with special living arrangements (e.g. dormitories, boarding houses) were randomly sampled. For each unit sampled, a randomly selected adult and child (if present) were used to collect core health information. Beginning in 1997, individuals aged 65 and above who were black, Hispanic or Asian were oversampled. We combined data from 1997 through 2000. Our sample was linked to the National Death Index (NDI) with mortality follow-up through December 31, 2005 ensuring all respondents were tracked for at least five years after completing the NHIS. Self-reported weight and height measurements without shoes were used to construct BMI. Smoking history was a dichotomous variable indicating whether an individual has ever smoked. Individuals below the age of 18 and under 18.5 BMI were excluded from our analysis. We also excluded 4,599 respondents who have missing BMI measurements or have a BMI of over 99.99, whose observations were truncated, and 229 observations with unknown smoking status. The final sample contained 117,961 respondents.
We investigated the relationship between mortality and obesity through BMI using the logistic regression model, stratified by gender and adjusting for age and smoking history. The logistic model was chosen over the Cox proportional hazards model because the proportional hazards condition did not hold for the BMI fractional polynomial (FP) terms. We used 5-year all-cause mortality as the dependent outcome because of the very low incidence of death annually. To check for robustness, we also estimated all models using 3-year mortality as the outcome, which produced similar results. Analyses were stratified by gender because the biological process by which men and women gain and maintain weight is different [
29]. We adjusted for smoking status because it confounded the BMI-mortality relationship, which if ignored may result in overestimation of the BMI associated with minimum mortality [
30]. Sample adult weights from the NHIS, which denoted the inverse probability of inclusion into the sample were used within the logistic regression model to correct for potential biases resulting from the NHIS sampling design. Because data were pooled, sampling weights were divided by the number of years to generate a sample that is representative of the U.S. population on average from 1997-2000.
We maintained BMI as a continuous variable in our analysis. To account for the nonlinear and asymmetric relationship between BMI and mortality, we first applied the fractional polynomials [
31,
32] method. To allow for flexibility in fitting a curve with a single turning point, we considered second degree polynomial transformations for BMI. We used the closed test procedure [
33] which first determined the best fitting second degree polynomial by choosing power transformations from the set {-2, -1, -0.5, 0, 0.5, 1, 2, 3}, where 0 denotes the log transformation. The best fitting second-degree FP was then compared against the null model using a deviance difference test with four degrees of freedom to determine whether BMI should be included in the model. If the first test was statistically significant, a second deviance difference test with three degrees of freedom was applied to compare the best fitting second degree FP against the linear model. If the second test was significant, a final deviance difference test with two degrees of freedom compared the best fitting second degree FP with the best fitting first degree FP. If the final test was significant, the second degree FP was included, otherwise the first degree FP was chosen. To prevent collinearity and model overfit, the best fitting first degree polynomial was chosen for age. The selection of powers for BMI and age was computed simultaneously using the multivariable fractional polynomials (MFP) method [
26,
27], which combined backward elimination to select the best fitting model. The regression model we estimated was
where πi was the 5-year death probability for individual i, p1 and p2 were the fractional powers for BMI, and q1 was the fractional power for age. The MFP method also scaled and centered variables in model selection process to improve numerical stability and to provide a model intercept that was easier to interpret. A nominal p-value of 0.05 was used to test all hypotheses. To evaluate the validity of the FP model for BMI, we graphically compared three models with the main FP model defined by (1). First, we estimated the model which categorized BMI into 30 narrow bins (1 bin for each BMI unit between 18 and 40, for every two BMI units between 40 and 54 and a single bin for BMI above 54) while also adjusting for age and smoking status. We then estimated separate FP models after omitting subjects with early death (< 1 year from baseline) and extreme BMI values (> 50).
Interactions between adjustment variables were tested to address the possibility of differences in the BMI-mortality curve across the age distribution, and by smoking history. The multivariable fractional polynomial interaction (MFPI) algorithm [
26] was used to assess interactions, which first determined the best fitting polynomial functions for BMI and age using MFP and then tested for significant interactions between fractionally transformed variables and smoking history using a deviance difference test. We then verified interactions found by the MFPI algorithm graphically using Lowess smoothed curves. The use of FPs when fitting models using BMI as a continuous variable avoided inclusion of spurious interactions in a strictly linear model.
The BMI associated with minimum mortality was calculated by first estimating the final MFP model (including interaction terms) using logistic regression. To derive the optimal BMI, we set the first derivative of the estimated FP model equal to 0 and solved for BMI. As an example, the optimal BMI for the model with linear and quadratic BMI terms without interactions is

, where

and

are the logistic regression coefficients for the linear and quadratic BMI terms, respectively. Confidence intervals were based on standard errors computed using the delta method.
We compared the BMI-mortality curves derived using the MFP method with the continuous BMI model containing linear and quadratic BMI terms and the categorical model based on WHO BMI classifications. Models were compared on the basis of model fit, the shape of the BMI-mortality curve, the magnitude and uncertainty in the BMI associated with minimum mortality and mortality estimates. All statistical models were fit using the STATA Statistical Software (Version 11; College Station, TX). The STATA procedure MFP was used to determine the functional forms for age and BMI and the procedure NLCOM was used to calculate estimates and confidence intervals for the BMI associated with minimum mortality.