Changes in gene expression with age: CEU grandparents
If the expression of a gene responds to the progress of senescence by rising or falling over the adult lifespan, then expression differences among chronologically age-matched adults may reflect variation in rates of biological aging. To establish age-related changes in gene expression levels, we first identified expressions most strongly associated with age at blood draw. Of the 8793 total measured expressions (probesets on Affymetrix HG-Focus arrays), we used only the 2151 always-expressed genes (in all 362 cell lines and three generations of Utah CEU families), to examine the relationship between individual expression levels and age at draw in the CEU grandparent cell lines. We modeled expression as a simple linear function of age at draw. Expression levels are reported on a log2 scale, so that a linear increase or decrease in measured expression level corresponds to a multiplicative increase in gene expression. Because the grandparents were effectively unrelated to one another, no adjustment for kinship was necessary, and conventional linear regression models were used.
A more complex analysis of changes in gene expression with age over all three generations of CEU families revealed a large number of highly significant associations. However, interpretation of age-related changes in expression across three generations, ages 5–97, is difficult because senescence can be confounded with sexual or physiological maturation, as well as secular trends occurring over such a long span of donor birth years in these families. Methods and results of this analysis are described in the Supporting Information.
Of the 2151 always-expressed genes, 345 (16%) expressions were associated with age at draw in the CEU grandparents at a nominal P
< 0.05. Of these, 125 increased with age and 220 decreased with age. shows the magnitude, direction, and significance of the linear regression of expression as a function of age at draw for the top ten age-associated expression levels for the 104 CEU grandparents, after adjusting for sex. None of the always-expressed genes was linearly associated with age at draw at a nominal P
-value below the Bonferroni 5% threshold of 2.3 × 10−5
. The strongest association of expression level with age was CDC42 (cell division cycle 42), which exhibited strongly increased expression with age, with a sex-adjusted P
-value of 3.1 × 10−5
, and an unadjusted P
-value of 1.3 × 10−5
. Among the ten most strongly age-associated expression levels (), equal numbers (five) increase and decrease with age. TableS3
(Supporting Information) shows complete results for the 2151 always-expressed genes.
Top ten age-associated expression levels in CEU grandparents
Also shown in is the estimated heritability of expression (H2) for each gene listed, and the correlation of expression between spouses. Most of the genes listed in have heritabilities between 0.2 and 0.5. CORO1A has the highest estimated heritability (0.66), while expression levels for RNH1 and TMEM142C do not appear to be heritable. Spouse correlations are generally very low, with moderate positive correlations observed for CDC6 (0.28) and CORO1A (0.23).
Proportional hazards models of survival vs. gene expression
We also tested for gene expressions most strongly associated with survival, independently of their associations with age at blood draw. shows the ten strongest associations between age-adjusted expression and survival after blood draw among the 104 CEU grandparents. Corresponding data for all genes are given in TableS3
(Supporting Information). None of the observed effects exceeds the Bonferroni threshold of 2.3 × 10−5
, although CORO1A (coronin) comes quite close. Nine of the ten strongest associations of age-adjusted expression with mortality are negative, meaning that relative overexpression of the gene is associated with reduced mortality. Interestingly, the one exception is TERF2IP (telomeric repeat binding factor 2, interacting protein), which is thought to protect telomeric DNA from nonhomologous end-joining. Note that only one gene (CORO1A) appears in both , supporting the notion that gene expressions strongly associated with age at blood draw are not necessarily strongly associated with survival, too. As in , most of the estimated heritabilities in are 0.25 or greater, while CORO1A is again highest (0.66). No evidence for heritability of expression was seen for TERF2IP, KIF2C, or EMP3. We observed moderately strong positive spouse correlations for IQGAP1 (0.22) and CORO1A (0.23).
Top 10 survival-associated expression levels in CEU grandparents
A total of 167 (7.8%) expression levels were associated with survival in the CEU grandparents at a nominal P
< 0.05. Of these, 48 were associated with age at blood draw at a nominal P
< 0.05. TableS2
(Supporting Information) shows complete results for the 2151 always-expressed genes.
Combining information on age at draw and survival after draw
The results shown in have limited power relative to the large number of comparisons because expression data were available for only 104 grandparent samples. However, both the association of age with expression level, and the association of expression level with survival after blood draw, contains distinct and important information about the relationship of gene expression to aging in humans. We used two strategies to combine these pieces of information (see Methods, below): (i) a test of the joint null hypothesis that expression is related to neither age nor survival; and (ii) a two-stage model that constructs a multivariate estimator of biological age, then uses it to predict survival. We describe these results below.
Tests of the joint hypotheses that expression is related to neither age nor survival
shows the relationship between Z
-scores for age effects and survival effects on (age-adjusted) expression levels, for the CEU grandparents. Using Fisher’s likelihood ratio approach (Fisher, 1932
), we compare the observed Z
-scores (large red dots) to those generated by a 10% sample of 1000 random permutations of the phenotypic (age, sex, and survival) data, while keeping the expression vectors constant (small black dots). The dashed ellipse is drawn at the 50th largest chi-squared value observed for the 2151 always-expressed genes and 1000 permutations. The results are shown this way to approximate the fifth percentile of the null distribution, adjusted for 2151 comparisons. Three genes fall outside the threshold: CORO1A (0%), CDC42 (0.2%), and AURKB (aurora kinase B; 1.5%). Expression of CDC42 increases with age among the CEU grandparents, and higher age-adjusted expression is associated with higher mortality. CORO1A and AURKB represent the opposite extreme of this same pattern: expression decreases with age, and higher age-adjusted expression is associated with lower mortality.
Fig. 1 The standardized linear effect of age on each of 2151 expression levels among 104 CEU grandparents is plotted on the x-axis against the standardized effect of each expression level on mortality (log hazard rate ratio). Larger red dots indicate the observed (more ...)
Overall there is a fairly strong positive correlation (r = 0.51; P< 2.2 × 10−16) between Z-scores for age-related expression change and age-adjusted survival in . The orientation of this general pattern is described by the contrast between CORO1A in the lower left quadrant, and CDC42 in the upper right quadrant. Points located in the lower left quadrant (e.g. CORO1A) represent expression levels that decrease with age, and where relative underexpression is associated with higher mortality. Points located in the upper right quadrant (e.g. CDC42) represent expression levels that increase with age, and where relative overexpression is also associated with higher mortality. In contrast, the randomly permuted data are distributed roughly equally around the origin, including many points in the upper left and lower right quadrants. The observed distribution includes relatively few points in these quadrants, and none near the extremes of the null distribution.
It is likely that the results shown in do not fully illuminate all the patterns of age-related changes in gene expression and survival. Given that cross-sectional data were used to estimate longitudinal patterns of gene expression, we might worry that particular age-related trends in expression are confounded by differential survival, which could cause expressions that do not change with age to appear to do so, or mask true patterns of change. Associations of this type should occur in the upper left or lower right quadrants of , because a positive relationship between expression and mortality should be associated with apparently declining expression, while a negative relationship should be associated with apparently increasing expression.
It is also possible to imagine examples of upper left and lower right quadrant effects with a biological basis. For example, expressions that reflect damage repair or detoxification processes might rise with age and overexpression of these genes might be beneficial to longevity. However, our results offer no persuasive evidence that such associations are very common.
Multivariate models of biological age vs. survival
It is perhaps intuitively less appealing to examine the association of variation in gene expression for complex traits one gene at a time, than to allow all measured genes to compete with one another simultaneously for inclusion in a model that adjusts for correlations among expression levels. If the number of samples were larger than the number of genes considered, standard multiple linear regression methods might be a suitable approach to this problem. However, like most studies involving microarray data, we have many more potential predictor variables than independent samples. Numerous approaches to the problem of more dimensions than data points have been proposed, but many of them sacrifice interpretability because they collapse correlated expression patterns into composite variables without regard to biological plausibility. Penalized regression approaches can be interpreted much the same as multiple linear regression because they preserve the original data; overfitting is avoided by incorporating a ‘regularization parameter’ that penalizes the likelihood function according to the number of terms in the model. Several different penalized regression schemes have been proposed, including ridge regression (Hoerl & Kennard, 2000
), bridge regression (Frank & Friedman, 1993
), and the LARS/LASSO (for least angle regression/least absolute shrinkage and selection operator) family of approaches developed by Efron et al. (2004)
In the Methods section, below, we describe estimation and cross-validation procedures for the LASSO model of biological age. shows the leave-one-out cross-validation (LOOCV) curve observed over the first 40 steps (black line), compared to LOOCV curves generated under 100 random permutations of the phenotypic data. In the blue line plots the probability, at each step, that a model generated from random data has a mean-squared error (MSE) as low as the observed model; the red line plots P-values generated by proportional hazards regression using the biological age estimate as a predictor of mortality. It is clear from that the observed LOOCV curve is below the fifth percentile of the distribution of random curves by step 14, and the observed curve remains lower than any randomly generated curve for all steps after step 28. Meanwhile, the predicted biological age generated from the observed model is strongly significantly related to survival after blood draw from steps 2 to 30. The estimated MSE at step 14 is 57, corresponding to a prediction error of ± 7.6 years (7.4 years at step 28). shows the slope coefficients at steps 14 and 28.
Fig. 2 Least absolute shrinkage and selection operator model of biological age. At each step, an additional gene may be added to or subtracted from the model. (a) Mean square classification error (MSE) estimated by cross validation, for the observed data (black (more ...)
The most parsimonious model with an MSE lower than 95% of simulations occurs at step 14. Coefficients of the model, in decreasing order of absolute value, are: CDC42 (5.5), SEPT2 (1.6), PBX3 (1.1), CIB1 (−0.91), SH3BGRL (0.77), UBE2A (−0.71), RNH1 (−0.60), PPP1R11 (0.55), QDPR (−0.48), DDX24 (0.36), GINS2 (−0.30), LPXN (0.24), and ACAA1 (0.16). Clearly, the positive association of CDC42 with age at draw dominates this model. This remains true at steps 20, 40, 60, and 80 (data not shown). Expression levels of CDC42, SH3BGRL, QDPR, and RNH1 were also among the ten most strongly associated with age at draw for CEU grandparents in the univariate analysis reported in . In the LOOCV models, all the selected terms were included over 85% of the time at step 14, with the exception of LPXN (39%), DDX24 (4%), and ACAA1 (0%).
We used the step 14 model from to generate estimated biological ages for the CEU grandparents, then include them together with chronological age at draw and sex, in a proportional hazards model of survival. As expected, predicted biological age was positively associated with mortality. The hazard rate ratio (an estimate of relative risk) for a single year increase in estimated biological age was 1.33 (95% CI: 1.10–1.62).
An alternative approach to modeling survival as a function of estimated biological age would be to model it as a function of the difference between biological and chronological age. This is equivalent to forcing both biological age and chronological age to have the same slope (with opposite signs), and is efficient only if both variables are scaled identically. We evaluated this approach by rescaling biological age to have 0 mean and unit variance, then multiplying by the standard deviation of chronological age (7.9) and adding the mean chronological age (71.5); that technique produced results (not shown here) essentially identical to our original method, reported above.
Multivariate models of survival vs. expression level
The above analysis demonstrates that biological age, estimated from gene expression levels that change with age, is a significant predictor of remaining life span. An important potential weakness of this analysis is that it places too much emphasis on gene expressions that vary systematically with age in cross-sectional data. As noted above, the confounding of differential survival effects with age-related changes in expression could mask the effects of genes that regulate or contribute to survival.
In an effort to circumvent this limitation, we also applied the LASSO approach to model survival as a direct multivariate function of expression levels. We computed deviance residuals from a baseline survival model adjusted for age at draw and sex, and used LASSO to identify expression levels associated with variation in the deviance residual [see Methods and reference (Segal, 2006
)]. We used the same permuted LOOCV approach for cross validation and permutation tests of significance (described above and in Methods). Results are shown in .
Fig. 3 Least absolute shrinkage and selection operator model of survival. At each step, an additional gene is added to or subtracted from the model. (a) Mean square classification error (MSE) estimated by cross validation, for the observed data (black line), (more ...)
shows the cross-validation results compared to results of 100 random permutations of the phenotype data; a minimum is reached at step 7. In , the observed cross-validation MSE was smaller than 94% of those observed in permuted data at step 7. Model coefficients are in order of decreasing absolute value: CORO1A (−0.27), FXR2 (0.21), CBX5 (−0.074), PIK3CA (−0.0094), AKAP2 (−0.0086), and CUL3 (−0.0081). The model is dominated by the positive association between FXR2 expression and mortality, and the negative association between CORO1A expression and mortality. Negative associations of AKAP2, CUL3, CBX5, and PIK3CA with mortality also contribute to the model. shows that the linear predictors generated by this model are strongly associated with survival: P-value = 4.0 × 10−8; inter-quartile relative risk (IQRR) = 2.35; median estimated survival difference = 5.5 years. Predicted mortality from the model accounts for 23% of the variation (R2 = 0.23) in survival among the CEU grandparents. Model coefficients for individual genes are converted into IQRR in and plotted on a log scale.
Performance of LASSO mortality model by sex, cause of death, and time since blood draw
The nominal P-value given in for the overall data is very low, but unsurprising, given that the same survival data were used for estimation and testing. Evaluating the ability to predict survival in selected subgroups (e.g. males vs. females, for various causes of death, or varying numbers of years after blood draw) may be more informative than showing how well the model fits the overall data. shows that the model predicts similar mortality risks for males and females. As age is the single largest risk factor for multiple life-threatening diseases, a biomarker that truly reflects biological age (or rate of aging) should be associated with risks of dying from not one, but several common causes of death due to age-related diseases. Therefore, we tested the panel of gene expressions from the survival model (LASSO) for associations with mortality risks for the common causes of death. shows that the LASSO model predicts risk from multiple causes of death, in spite of very small sample sizes, with a particularly strong (but imprecisely estimated) effect for deaths attributed to diabetes. It is worth noting that the number of causes of death listed in (6) is larger than the number of genes contributing importantly to the model (3), so the possibility that these associations are caused by overfitting seems slight.
also shows that the LASSO model remains strongly predictive of mortality for at least 10 years following blood draw. Thus, these associations are not likely due to the presence of terminal diseases in some research subjects at the time of enrollment in the study.
Although demonstrates that the predicted model is not strongly affected by subgroup influences on gene expression, a more robust assessment of whether the fit of the model was produced by chance is given by the permutation distribution of cross-validation curves shown in . The minimum cross-validation MSE of the model is smaller than 94% of those generated by random permutation of the phenotype data (step 7). Therefore, there is approximately a 6% probability that the LASSO model of survival, based on 2151 measures of gene expression, fits our data this well by chance.
Inter-individual differences in gene expression profiles in the Utah CEU lymphoblastoid cell lines may reflect not only heritable genetic influences, but also environmental exposures experienced at any time prior to blood draw. Therefore, we considered the possibility that gene expressions associated with survival may simply reflect exposures (or nonexposure) to a common toxic agent, such as cigarette smoke. The data available to us do not contain information on environmental exposures; however, affiliation with the Church of Jesus Christ of Latter-day Saints (or LDS church), available from the UPDB, indirectly provides information about exposures that affect mortality risks. Merrill et al. (2003)
using data from the 1996 statewide Utah Health Status Survey, report that 9.2% of LDS men (vs. 24.5% of non-LDS Utah men) reported being current smokers, while only 4.1% of LDS women reported smoking (vs. 23.1% of non-LDS women). Of the 104 grandparents with expression data who linked to the UPDB, 77 (74%) were strongly affiliated with the LDS church, and this is probably an underestimate because UPDB data are very incomplete in this regard. We thus would expect only a small number (probably less than ten) of the grandparents to have been smokers. In previous work with the UPDB, we have shown that reduced smoking and alcohol consumption among active church members probably accounts for 1.3 additional years of life expectancy compared to Utahans unaffiliated or inactive in the church (Kerber et al., 2001
). Inclusion of church affiliation as a covariate in the survival models slightly strengthens the relationship of the model predictions to survival (data not shown), indicating that the results reported in are not confounded by smoking. Furthermore, none of the genes listed in , or included in the LASSO survival model, have been reported to be significantly affected by cigarette smoking (Ryder et al., 2004
; van Leeuwen et al., 2005
It is apparent in that spouse correlations for some individual gene expressions are moderately high. Across the entire set of 2151 always-expressed genes, the highest observed spouse correlation is 0.50 (PRSS3). CORO1A and IQGAP1 exhibit spouse correlations > 0.2, although the estimated heritability for each is quite high. The mean spouse r2 across the entire dataset is 0.962 while the maximum r2 for 100 random pairs of grandparents was slightly smaller (0.958; minimum = 0.952, mean = 0.955). Overall, then, spouse expression profiles are slightly more strongly correlated than expected by chance, although the level of correlation among all expression profiles is very high. Interestingly, however, the correlation of mortality risk between spouses (using the deviance residuals from the baseline proportional hazards model – see Methods) is only 0.075 (P-value 0.45), so correlations in expression are not likely to confound the survival analysis.