|Home | About | Journals | Submit | Contact Us | Français|
Mathematical modeling of cancer development is aimed at assessing the risk factors leading to cancer. Aging is a common risk factor for all adult cancers. The risk of getting cancer in aging is presented by a hazard function that can be estimated from the observed incidence rates collected in cancer registries. Recent analyses of the SEER database show that the cancer hazard function initially increases with the age, and then it turns over and falls at the end of the lifetime. Such behavior of the hazard function is poorly modeled by the exponential or compound exponential-linear functions mainly utilized for the modeling. In this work, for mathematical modeling of cancer hazards, we proposed to use the Weibull-like function, derived from the Armitage-Doll multistage concept of carcinogenesis and an assumption that number of clones at age t developed from mutated cells follows the Poisson distribution. This function is characterized by three parameters, two of which (r and λ) are the conventional parameters of the Weibull probability distribution function, and an additional parameter (C0) that adjusts the model to the observational data. Biological meanings of these parameters are: r—the number of stages in carcinogenesis, λ—an average number of clones developed from the mutated cells during the first year of carcinogenesis, and C0—a data adjustment parameter that characterizes a fraction of the age-specific population that will get this cancer in their lifetime. To test the validity of the proposed model, the nonlinear regression analysis was performed for the lung cancer (LC) data, collected in the SEER 9 database for white men and women during 1975–2004. Obtained results suggest that: (i) modeling can be improved by the use of another parameter A- the age at the beginning of carcinogenesis; and (ii) in white men and women, the processes of LC carcinogenesis vary by A and C0, while the corresponding values of r and λ are nearly the same. Overall, the proposed Weibull-like model provides an excellent fit of the estimates of the LC hazard function in aging. It is expected that the Weibull-like model can be applicable to fit estimates of hazard functions of other adult cancers as well.
Mathematical models can help researchers in elucidating the fundamental mechanisms of cancer development and progression. Mathematical models enable a quantitative representation of transformations affecting cell states and cell numbers. One area of cancer modeling is an assessment of risk factors leading to cancer. In adults, aging is a common risk factor associated with cancer development.
The main aim of this work is mathematical modeling of cancer development in aging. Our modeling is based on the commonly accepted multi-stage concept of carcinogenesis and is expected to provide a better understanding of the biological processes underlying cancer development. We utilized observational data on cancer incidence rates collected in the Surveillance Epidemiology and End Results (SEER) database. These incidence rates were used to estimate a cancer hazard function by the log-linear age-period-cohort (LLAPC) model that accounts for age, time period and birth-cohort effects.1–4
Until recently, the exponential or compound exponential-linear functions have been widely utilized for mathematical modeling of cancer development in aging.5–8 According to these mathematical functions, the cancer risk should monotonically increase with aging. However, recent analyses of data collected in the SEER database showed that,9 after an initial increase, the cancer hazard functions in aging have turnover and even fall at the end of the lifetime.10 To model such behavior of the cancer hazards, beta-like functions have been utilized.11–13 It was shown that the observational data can be well-fitted by these functions. The use of beta-like functions, however, still lacks a sound biological background.
In the present work, we proposed to use Weibull-like functions for modeling the cancer hazards in aging. This model is derived from the Armitage-Doll multi-stage concept of carcinogenesis and an assumption that a number of clones at age t that developed from mutated cells follows the Poisson distribution. Besides two conventional parameters (r and λ) of the Weibull probability distribution function (pdf), this model has an additional parameter, C0, for the observational data adjustment. The model provides meaningful biological interpretation of the modeling parameters: r—the number of stages in carcinogenesis, λ—an average number of clones developed from the mutated cells during the first year of carcinogenesis, and C0—a data adjustment parameter that characterizes a fraction of the age-specific population that will get this cancer in their lifetime.
To test the Weibull-like model of cancer development in aging, we analyzed lung cancer (LC) data collected in the SEER 9 database during 1975–2004 for white men and women. The estimated values of the LC hazard function in aging were obtained from the observed incidence rates. It was found that the quality of modeling has been improved by introducing another parameter, A- the age (in years) at which a process of carcinogenesis starts. Performed modeling suggests that the LC presentation in aging in white men and women are different mainly by the parameters A and C0, while the parameters r and λ are nearly the same.
Overall, it was shown that the Weibull-like model of cancer presentation in aging provided an excellent fit of the estimates of the LC hazard function in aging. The proposed model explains the observed behavior of the hazard function in aging that initially increases with age, turns over and then falls at the end of the lifetime.
The main concepts of survival analysis include survival probability and hazard function. According to the notation given in textbooks,14,15 a survival probability is defined as the probability, S(t), that at the time t, a person is alive (in mortality studies) or at the age t a person is free from a given disease (in disease incidence studies). A hazard function, h(t), measures the relative risk of death at the time t, or getting a given disease at a specific age t, compared to the survival probability at the same time/age:
In survival modeling, time t is assumed to follow some statistical distribution with probability density function (pdf ) − f(t).15 Let us assume also that −dS(t)/dt can be presented in the form of C0 f (t) where C0 is a parameter. If f(t) and C0 are known, the following differential equation can be written:
with the initial condition
This condition means that at time t = 0 the person is alive or is free from the disease.
By solving the differential equation (2) with the initial condition (3), one can obtain S(t) by the following formula:
From (2) and (4), it follows that the formula (1) for a hazard function can be presented as:
If assumption (2) is valid, then, depending on the epidemiological problem under consideration, values of C0 can vary between 0 and 1. In fact, when t → ∞, formula (4) gives:
In mortality studies, S(∞) = 0, when t → ∞. From (6) it also follows that C0 = 1, and therefore the hazard function (5) can be written as:
In the case of a rare disease, a person’s risk of getting a given disease is very small, i.e. survival probability is close to 1:
From (8) and (4) it follows:
For a given age-specific population, parameter C0 characterizes a fraction of this population that will be exposed to the disease during their lifetime.
From (4), (5) and (8) it follows that for a rare disease, its hazard function can be presented as:
The aforementioned formulas (4), (5), (7) and (10) for survival probability and hazard function are based on assumption (2). Generally, the validity of this assumption can be tested by the methods of regression analysis using the formula (5), the right side of which presents a regression line with parameters to be determined. In (5), the time t can be considered as a predictor and h(t)—as a response variable that can be estimated from the observations. For a rare disease, according to (10), the regression line as a function of t can be approximated by C0 f (t). Therefore, in the regression analysis performed below we assume that:
To perform regression analysis, one needs to estimate the values of the response variable h(t). In the absence of time period and cohort effects, the hazard function h(t), can be interpreted as an instantaneous incidence rate.16 By definition, an incidence rate, I(t), is the number of new cases of cancer incidence over a period of time divided by the person-time at risk.14 Estimates of values of hazard function, h*(t), can be obtained from the observed incidence rates that have to be corrected for time period and cohort effects (see below).
To estimate the values of a cancer hazard function in aging, the recently proposed method can be utilized.4,10 This method allows one to correct the observed age-specific incidence rates I(t) for time period and cohort effects. These corrections can be done by the use of the LLAPC model that presents the expectation of the observed incidence rates as a product of the hazard function, hc(t), the time period effect coefficient, v, and the birth cohort effect coefficient, u. In practice, the observed values of I(t) are presented as Ii, j,c (ti), where ti is a given age interval, j—a time period interval of observation and c—indicates a given categorical risk factor (for example, gender, race, etc.). The procedure allows one: (i) to separate the problem of estimating the time period and birth cohort coefficients from the problem of estimating the unknown hazard function; (ii) to resolve the identifiability problem by an assumption that neighboring cohorts almost equally influence the Ii, j,c(ti) and by anchoring the time period and birth cohort effects to the selected time period and cohort; and (iii) after obtaining the time period and birth cohort coefficients, to estimate values of the hazard function, , and their standard errors, SEi, in each age interval, ti.4,10 Here and below estimates of statistical parameters as well as hazard function values are designated by asterisk (*).
The mathematical problem of modeling of cancer hazard functions can be reduced to the problem of fitting of by the model curve hc(t). This implies choosing an appropriate mathematical form of the curve, (t, C0 f (t)), which defines a model with the predictor t and the response variable C0 f (t). Thus, this problem is reduced to estimating unknown values of parameters in pdf, f (t), and an additional parameter, C0. These parameters can be estimated by a set of the observed values, , and their standard errors, SEi, in each age interval, ti. Estimation of these parameters can be done using the least squares method to solve the following system of the conditional nonlinear equations:
Since each has its own standard error, SEi, a system of conditional equations with weights:
has to be solved. In formula (13), SE is the standard error of the observation with the weight 1, calculated by formula:
In general case, the weighted system, (12)–(14), is solved by a method of least squares, utilizing an iterative technique (see, for example, MATLAB’s Statistical Toolbox 7.3, Weighted Nonlinear Regression).17
The mathematical model of cancer development is expected to be related to an appropriate biological concept of carcinogenesis. We used the Armitage-Doll multi-stage concept of carcinogenesis and demonstrated that this biological concept mathematically leads to the Weibull-like mathematical form of cancer hazard functions in aging.
According to the Armitage-Doll concept, a normal cell can be transformed into a cancer cell, after the r required mutations occurred within this cell. Below, we applied this concept using the notations and logic described in literature.18,19
Let us assume that the process of carcinogenesis begins at the age t = 0. Suppose that θj is the mutation rate at the j-th gene per year and the probability that a mutation at the j-th gene occurs in a cell prior to the age t is a small number and approximately equals to θj · t. Then, a product estimates the probability that in the given cell the required r mutations occurred prior to time t. The parameter r defines the number of stages in carcinogenesis. The average number of clones that were developed from the mutated cells up to the time t can be presented as:
where cn is the number of cells at risk of mutation. In this formula, the parameter λ is the average number of clones developed from the mutated cells during one year (t = 1) after the beginning of carcinogenesis.
Let N(t) be the cumulative number of mutated clones that occurred from age 0 to t and let one follow to the homogeneous Poisson process (HPP).20 In such a case, according to the definition of HPP, a probability P0 that the cumulative number of mutated clones is equal to a given number k could be obtained by the formula:
where the expected number of mutated clones, μ0(t), is proportional to the first order of the age t. In HPP, a cumulative distribution function (cdf ), F0(t), of the waiting time to the first occurrence of the mutated clone will be expressed by the formula:
In the Armitage-Doll model, however, occurrence of mutation can be considered as a non-homogenous Poisson process (NHPP). In this case, parameter μ(t) is given by formula (15), where λ and r are constants. It should be noted, that in contrast to HPP, in NHPP the expected number of mutated clones μ(t) is proportional to the r-th order of the age t.
For the NHPP model, a probability P0 that the cumulative number of mutated clones equals to a given number k could be obtained by the formula analogous to (16), and cdf of the waiting time to the first occurrence of mutated clone can be calculated by formula analogous to (17), where μ0(t) is substituted by μ(t):
Then, using (15) and (18), the pdf of the waiting time t to the first occurrence of cancer of mutated clone, f (t) = d/dt F(t) can be presented as a Weibull distribution with the shape parameter r and the combined scale-shape parameter λ:
It is known that for most of the cancers, only a tiny fraction, say C0, of a given age-specific population will (early or later) develop cancer, while the majority of this population will not get this disease in a lifetime.23 In this case we can assume that a person’s overall risk of getting the cancer equals to C0, and the corresponding hazard function, h(t), can be presented by formula (11). As was shown previously, according to the Armitage-Doll concept, the pdf f(t) of cancer occurrence at age t, can be presented by the Weibull distribution in form (19). Therefore, according to (11) and (19), h(t) should have a form of Weibull-like function, which can be presented as a product of Weibull pdf and an additional adjustment parameter, C0:
The parameter r of this function indicates the number of stages in carcinogenesis, the parameter λ is an average number of clones developed from the mutated cells during the first year of carcinogenesis, and parameter C0 is an adjustment parameter.
Formula (22) states that when age t of cancer occurrence has the Weibull pdf (19), then the corresponding hazard function is modeled by the Weibulllike function (22). In contrast to the classical paper,5 where the exponential function was used for modeling of the cancer hazard in aging, we propose here that the cancer hazard function has a Weibull-like form (22), which has a turnover and falls at old ages.
Validity of our assumption that the h(t) will have a form of Weibull-like function (22) can be tested by methods of regression analysis, considering the age t as a predictor and h(t)—as a response variable to be estimated from the observations.
Since in statistical literature the Weibull pdf is usually presented by formula (20), the corresponding Weibull-like hazard function can be written as:
where the parameter r defines the shape of h(t), the parameter b scales the distribution of t along the abscissa axis and the parameter C1 = C0 r/b scales the curve along the vertical axis.
We have shown that cancer hazard functions in aging in a population can be described by the Weibull-like function, presented by the formula (24). The parameters C1, b and r can be assessed from bivariate data (ti, ) by the aforementioned least squares method assuming that in this specific case, can be presented in the following form:
Goodness of fitting of the regression line to the observed data can be quantified by the weighted sum-of-squares (SS) of the residuals ri of the system of the corresponding weighted conditional equations:
or by the coefficient of determination:
The curve fitting is getting better as R2 approaches values close to 1.25
To compare the quality of fitting of the same dataset by different regression lines, the Akaike’s Information Criterion (AIC) can be used.25 Assuming that the scatter of points around the regression line follows a Gaussian distribution, the AIC is defined by the following equation:
where K = p + 1 and p is the number of parameters used for curve fitting. When the number of data points n are at least two times greater than the number of the assessing parameters, p, a second-order (corrected, c) criterion is used:
when the qualities of fitting of the same dataset by different regression lines are compared, the curve fitting is better for the line with the smallest AICc.25
The lung cancer (LC) incidence data, collected in the SEER 9 database during 1975–2004 for white men and women, were used to estimate values of the age-specific hazard function in five year age intervals ti. Data preparation and filtration were performed by the previously described protocol.4 LC incidence rates, expressed per 100,000 person per year, were age-adjusted by the direct method to the 2000 United States standard population.26 Estimates of the age-specific hazard functions and their standard errors SEi, anchored to the 2000–2004 time period and to the 1925–1929 birth cohort, were obtained using our recently proposed approach.4,10
To perform the curve fitting of bivariate data (ti, ), we modified the aforementioned procedure. In the modified procedure, instead of age t, the period of “effective exposure”, t − A, (where A is the age at the beginning of cancer) was utilized. The use of the “effective exposure” period was proposed in literature to improve the quality of curve fitting.5 In this case, the system of the weighted conditional equations can be defined as:
It should be noted that adding a new “shift” parameter (A) to the three parameters that have been estimated by regression analysis makes this analysis computationally unstable. Our numerical experiments have shown that by introducing an additional parameter, one has dealt with a typical “ill-posed problem”, in which small errors in observed data-cause big errors in estimated values. An analogous problem arises in the beta-like modeling of cancer in aging.27 To avoid such computational instability, we used a method of regularization of solution.28 For this purpose, by fixing different values of a possible “effective exposure” period (A), we regularized the process of our regression analysis and found the best solution for the other three parameters.
The parameters C1, b and r can be assessed from bivariate data (ti, ) using the the aforementioned least squares method Specifically, to estimate these parameters we utilized the “nlinfit” function from the MATLAB’s Statistical Toolbox 7.3. This function allows one to determine the estimates of weighted parameters and perform curve fitting.24 In the process of the parameter estimation, “nlinfit” utilizes an iterative technique that requires one to provide appropriate starting values of parameters to be estimated. Our computational experiments showed that the values 5, 60 and 0.01 can be used as appropriate starting values for the estimates of the parameters, r*, b*, and , correspondingly.
The output data of the “nlinfit” was used as input for two other MATLAB functions, “nlpredci” and “nlparci”. The function “nlpredci” returns as output: (i) the error estimates on predictions, ypred; and (ii) the half-widths of the 95% prediction intervals for future observations, delta (note, 2 × delta predicts the observations with the weights of w = 1, or the observations with the variance, SE2). For , r* and b*, the function “nlparci” returns as output the estimates of their errors. To estimate covariance between parameters, the Matlab function “nlinfit” provides the covariate matrix, Sigmaw. This matrix was used as input for the Matlab function “nlparci” to obtain 95% confidence intervals (CIs) for parameter estimates.
To evaluate the quality of fitting of the same dataset by different regression lines, we used the Akaike’s Information Criterion (AIC) as described previously.
The obtained values of the age-specific LC hazard functions and their standard errors for the age groups for which these estimates are statistically distinguishable from zero are presented in Table 1.
Using bivariate data, (ti − A, ), we performed curve fitting for several possible “effective exposure” periods. For this purpose, A values were varied from 0 to 30 years with five-year steps. Our calculations showed that for white men the best fitting is achieved when A = 20; while for white women it is best when A = 15. Table 2 presents the best fitted values of the parameters, , r* and b*. In addition to the , r* and b*, the estimates of the parameters, λ* and , are also given in Table 2. These estimates were calculated by the following formulas:
It should be noted that the CIs of the estimates , r* and b* could be underestimated, because are not directly observed, but are estimated by observed incidence rates. In practice, when response variable is indirectly observed, to assess CIs of parameters of regression, a bootstrap method can be used. Our computational experiments showed, however, that the CIs for parameters of the nonlinear regression, obtained by the use of the bootstrap method, were insignificantly different from those obtained by the approach proposed in this paper.
Figure 1 presents the Weibull-like curve that is fitted to the LC observational data for white men, assuming that the effective exposure period starts at the age A = 20. Figure 2 shows an analogous curve fitted to the LC observational data for white women with A = 15. These figures demonstrate that the LC observational data are well-fitted by the Weibull-like functions.
In addition, to verify the goodness of curve fitting, we constructed probability plots for weighted residuals, as described in the statistical textbook.29 The obtained normal probability plots did not show any significant trends that can be considered as an additional evidence of goodness of fitting.
Based on these fitting data, we hypothesized that the LC carcinogenesis starts earlier in women than in men.
In white men and women, the corresponding estimated values of the parameter r* are very similar and close to 5. Analogously, the estimated values of the parameter λ that defines the average number of clones, developed from the mutated cells during the first year since the beginning of carcinogenesis, are also similar for both genders. This suggests that in both genders the processes of carcinogenesis in aging are different mainly due to the starting age of carcinogenesis and the fraction of the population exposed to the LC. The starting age of carcinogenesis in white men is about 20 years old and the parameter characterizing the fraction of the age-specific population exposed to the LC is about 197 per 10,000. For white women, the starting age of carcinogenesis is about 15 years old and the parameter characterizing the fraction of the age-specific population exposed to the LC is about 132 per 10,000.
In this paper, we modeled cancer hazards in aging by the Weibull-like function: h(t) = C0λrtr−1 exp(−λtr). This form of hazard function was derived from the Armitage-Doll multistage concept of carcinogenesis and an assumption that the number of clones developed from the mutated cells follows the Poisson distribution. The proposed modeling function is characterized by three parameters: r and λ are the conventional parameters of the Weibull probability distribution function, and the additional parameter, C0, adjusts the model to the observational data. These parameters have the following biological meanings: r is the number of stages of carcinogenesis; λ is the average number of clones developed from the mutated cells during the first year since the beginning of carcinogenesis; and C0 is the data adjustment parameter characterizing a fraction of the age-specific population that will develop the considered type of cancer in their lifetime.
Validity of the Weibull-like model for cancer development in aging was tested by the methods of nonlinear regression analysis using the lung cancer data, collected in the SEER 9 database during 1975–2004 for white men and women. The performed analysis showed that the use of the period of “effective exposure”, t − A, improved the quality of the modeling. For white men the best quality was obtained when A = 20, while for white women the best quality was when A = 15. The number of stages of carcinogenesis in white men and women was shown to be similar and close to five, and the average number of clones developed from the mutated cells during the first year since the beginning of carcinogenesis are also similar. The obtained results suggest that in white men and women, the processes of carcinogenesis are different mainly by the starting ages of carcinogenesis and the data adjustment parameters characterizing the corresponding fractions of the age-specific population exposed to the LC.
Overall, we can conclude that the used incidence rate data is consistent with a Weibull model of carcinogenesis that is adjusted for the age of initial cancer exposure. Specifically, the Weibull-like model was shown to fit very well the estimates of the LC cancer hazards in aging that initially increase with the age, turn over and then fall at the end of the lifetime. It is expected that the Weibull-like model can be applicable to other adult cancers as well.
In conclusion, in our work we have studied only age-related correlative factors. These factors should not be directly equated with causation of cancer development. To elucidate causative factors, comprehensive biological models should be further developed.
This work was partially supported by 1 R01 CA140940-01 A1 (NIH, SS the PI) grant and an LB506 grant (Nebraska Department of Health, SS the PI). The authors acknowledge Dr. Leo Kinarsky and Mr. Michael X. Gleason for fruitful discussions and help in the preparation of this work for publication. The authors also acknowledge unknown referees whose comments and suggestions helped to improve the manuscript.
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.