PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of sagmbStatistical Applications in Genetics and Molecular BiologySubmit to Statistical Applications in Genetics and Molecular BiologySubscribeStatistical Applications in Genetics and Molecular Biology
 
Stat Appl Genet Mol Biol. Jan 1, 2010; 9(1): Article 41.
Published online Nov 22, 2010. doi:  10.2202/1544-6115.1617
PMCID: PMC3004784
Predicting Patient Survival from Longitudinal Gene Expression*
Yuping Zhang,* Robert J. Tibshirani, and Ronald W. Davis
*Stanford University, yupingz/at/stanford.edu
Stanford University, tibs/at/stat.stanford.edu
Stanford University, dbowe/at/stanford.edu
Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.
Keywords: prediction, time course, gene expression, survival
Microarray technology is dramatically changing studies in molecular biology, permitting the generation of high-throughput data on a global scale. In particular, one of the most exciting areas of application has been its use in clinic studies. Microarray technology helps investigators gain a fresh understanding of disease processes at the molecular level, and aids in the classification of diseases as well as the identification of new subtypes of diseases. Furthermore, depending on an individual’s genetic information, investigators will be able to diagnose and prognose, as well as come up with a treatment plan. This will lead to the realization of “personalized medicine”.
In methodology, finding associations between expression profiles and phenotypic data is the key to applying gene expression to a clinical study. In the stationary gene expression scenario, the main challenge is the well-known “p [dbl greater-than sign] n” paradigm. Dimension reduction methods have been studied extensively when mRNA abundances are measured from DNA microarrays at individual time points. In summary, two strategies are distinguished in dimension reduction: feature selection and feature extraction. Partial least squares regression, see Nguyen and Rocke (2002), Aastveit and Martens (1986), Martens and et al. (1986), Jacobsen, Kolset, and et al. (1986), Cramer, Bunce, and et al. (1988), Helland (1990), Faigle, Poppi, and et al. (1991), Wakeling and Morris (1993), Yeh and Spiegelman (1994), Young (1994), Holcomb, Hjalmarsson, and et al. (1997), Chun and Keles (2010) and ridge regression Jain (1985), Lecessie and Vanhouwelingen (1992), Hoerl and Kennard (1990), Vangaans and Vriend (1990), Pliskin (1990), Hertz (1991), Gana (1995), Forrester and Kalivas (2004), Hu (2005), Farkas and Heberger (2005), Maruyama and Strawderman (2005), Ozkale (2008), Akdeniz and Tabakan (2009) are techniques of dimension reduction. Compared with the L2penalty used in ridge regression, Tibshirani developed the Lasso method using L1penalty in 1996, see Tibshirani (1996). The L1penalty has a feature selection procedure that shrinks some of the coefficients to zero. In 2002, Tibshirani et al proposed a nearest shrunken centroid method to classify subtypes of a disease, see Bair and Tibshirani (2004). Another competitive method is the supervised principal component analysis (SPC) which was developed by Bair, Hastie, and et al. (2006). It contains two dimension reduction steps. This method first performs an univariate supervised feature selection, then constructs the principal components accompanied by a feature extraction. This method can be used in both quantitative outcome and survival outcome scenarios. These stationary gene expression data based methods have been successfully applied to identify biomarkers and predict patient outcomes especially for cancer disease, see Korde, Lusa, and et al. (2010), Desmedt, Haibe-Kains, and et al. (2009), Pandit, Kennette, and et al. (2009), Pandit et al. (2009), Karlsson, Delle, and et al. (2008), Schorge, D., and et al. (2009), Naderi, Teschendorff, and et al. (2007), Lin, Friederichs, and et al. (2007), Schneeweiss, Lichter, and et al. (2006), Nimeus-Malmstrom, Ritz, and et al. (2006).
In clinical studies, more and more investigators monitor the status of patients and measure the longitudinal gene expression. In the study of longitudinal gene expression, many methods have been employed that extract temporal patterns of differential expression. Storey, Xiao, and et al. (2005) used a spline-based approach to detect changes in expression over time within a single biological group or to detect differences of the behavior of expression over time between two groups. Yuan (2006), Yuan, Li, and et al. (2008) applied hidden Markov models to analyze microarray time course data under multiple biological conditions. Tai (2006) developed a multivariate empirical Bayes statistic to detect differentially expressed genes. Ma, Zhong, and Liu (2009) proposed a functional ANOVA mixed-effect model to model time course gene expression observations and identify differentially expressed genes. Zhou, Xu, Herndon, Tompkins, Davis, Xiao, and Wong (2010) proposed TANOVA to simultaneously handle the time course and factorial structure in microarray data. However, to our knowledge, no prediction method using the longitudinal structure has been published yet.
In this paper, we propose a novel prediction framework using longitudinal gene expression. We applied our method to the survival prediction of burn patients. By utilizing the longitudinal structure, we obtain much better performance of prediction than using the individual time points or simply pooling the gene expression from all time points.
We assume there are p features (e.g. genes) measured on N observations across t time points. Let Xg denote the N × t matrix, which is the expression matrix of gene g across t time points. Let Y be an N-vector of outcome measurements. The outcome can be the survival times subject to censoring. To make use of the time course information, we evaluate gene response based on information pooled across time. Specifically, we search for a direction in the t–dimensional time space that has the strongest response signal of interest, and extracts a predictor based on the projection of this direction. We name the direction “optimal direction”. This direction captures the gene response related to the outcome. The paragraphs below are ordered as follows. First, we give the estimation of “optimal direction”. Second, we introduce the procedure for model selection and the extraction of final predictors.
2.1. Estimating the optimal direction
The unknown optimal direction is gene specific and dependent upon the outcome variable of interest as well as the objective function we choose. To estimate the direction, we first define an appropriate objective function. The objective function is required to capture the correlation between the outcome and longitudinal gene expression. We propose a gene specific multivariate Cox regression model to capture the correlation between time course gene expression and survival outcome. For each gene, we treat gene expression of each time point as a variable. Consider a model in which we start with a baseline hazard function λ0(t). We then want to model the influence of time course gene expression on this baseline hazard. We ignore the subscript of genes. We use an exponential function to model the influence, i.e. λ (t) = λ0(t)exp(X β), X is an N × t matrix. As showed in this model, the hazards are proportional, which was first proposed by Cox (1972). Let Rj denote the risk set at time yj, that is, {yi : yiyj, i [set membership] {1, ..., N}}. The Cox proportional regression model relates the time passes before the event (death of patients) occurs to the covariates (gene expression) that may be associated with that quantity. The unique effect of a unit increase in gene expression is multiplicative with respect to the hazard rate. The Cox proportional model avoids making potentially untenable distributional assumptions about the hazard based on conditional probabilities – given “an event happened at a particular time yj”, what the probability of observing l (with covariates Xl) is.
equation M1
(1)
where the numerator of the most left fraction is the hazard for individual l at time yj, the denominator is the sum of all the hazards at y j for all the individuals who were at risk at yj.
Each observed event contributes one term like above to the partial likelihood. The log partial likelihood is shown as follows:
equation M2
(2)
where i, j [set membership] {1, ...,N}.
We use the above log partial likelihood function as the objective function and seek to find a direction to maximize this objective function. Let [beta with macron above] denote the estimated coefficients by maximizing the log partial likelihood function. The amplitude of [beta with macron above] reflects the relationship between the hazard function and gene expression from each time point. Signs of [beta with macron above] reflect the relationship among gene expression of all time points. We then project the gene expression of individual time points to this direction and obtain the weighted gene expression i.e. [x with macron] = X[beta with macron above]. We use the estimated weighted [x with macron] as a new feature representing that gene. The weighted gene expression matrix X consist of p vectors [x with macron] with length of N.
2.2. Prediction model
After projecting the gene expression of individual time points to the “optimal direction”, one can obtain the weighted gene expression. The relationship between survival and the weighted gene expression varies across genes. If we could discover the underlying genetic factors of the patients, often reflected by a group of genes acting together in pathways, then we would do a better job of predicting patient survival. Thus, variable selection is necessary. Traditional variable selection techniques for stationary gene expression data can be applied to the weighted gene expression. Among the many variable selection methods, SPC and the Lasso are two of the most important ones. Bair et al. (2006) extended and modified the SPC by preselecting a subset of genes based on a score statistic. In survival prediction, the score statistic for predictor g can be chosen as
equation M3
where Ug = lg, Ig = −lg and lg is the partial likelihood relating the data for a single predictor and the outcome. The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods. Instead of minimizing the sum of squared errors, Park and Hastie (2007) and Friedman, Hastie, and et al. (2010) extended the Lasso technique in Tibshirani (1996) to the Cox regression model by maximizing the log-likelihood function. Let {([x with macron]i, yi, δi) : [x with macron]i [set membership] Rp, yi [set membership] R+, δi [set membership] {0, 1}, i = 1, ..., N} be N triples of p estimated weighted gene expression, a response indicates the survival time and a binary indicator δi = 1 for complete (died) observations and δi = 0 for right-censored patients. Rj is the risk set at time yj. The objective function with L1 constraint is showed as following.
equation M4
(3)
subject to Σ|αi|s Coefficients are found to maximize the above objective function. Variable selection is archived via the coordinate descent algorithm. Variable with non-zero coefficients will be selected as the predictors.
We can use either of these two methods to do variable selection based on the weighted gene expression. In the supervised principal component analysis scenario, the algorithm is as follows:
  • Compute the multivariate Cox proportional hazard regression coefficients for all time points of each feature. Obtain the weighted predictors X, with Xi denoting the usual vector of predictors ([x with macron]1,..., [x with macron]p) for the ith individual.
  • Center the weighted gene expression
  • Compute the score statistic for each gene based on the weighted gene expression.
  • Form a reduced data matrix consisting of only those genes whose score statistic exceeds a threshold θ in absolute value (θ is estimated by cross-validation).
  • Compute the first (or first few) principal components of the reduced data matrix.
  • Use these principal component(s) in a Cox proportional hazard regression model to predict the outcome.
To choose the value of the tuning parameter θ, we use two-fold cross-validation. One note about cross-validation is that we need to recalculate the coefficients in cross-validation, but not use the coefficients obtained from the training data. For selecting an appropriate value of θ, we recalculate the X* using the coefficients [beta with macron above] obtained in the whole training data. We then center each component of X* using the means derived from the training data X. Instead of doing SVD on X*, we derive the predictor using the right singular vectors derived from training data.
Alternatively, we can choose the Cox regression model with the Lasso penalty Park and Hastie (2007), Friedman et al. (2010) to select variables based on the weighted gene expression. The use of the L1penalty Cox regression model in the second stage is carried out as follows:
  • Compute the multivariate Cox proportional hazard regression coefficients for all time points of each feature. Obtain the weighted predictors X, with Xi denoting the usual vector of predictors ([x with macron]1,..., [x with macron]p) for the ith individual.
  • Variable selection is archived by maximizing the following objective function.
    equation M5
    (4)
    subject to Σ|αi| ≤ s. The tuning parameter s is estimated by cross-validation.
  • The predictors with non-zero [alpha] are the selected predictors. Use the selected non-zero predictors in a Cox proportional harzard regression model to predict the outcome.
Given the test data, we project the time course gene expression of the test data and obtain the weighted gene expression using the optimal direction estimated from the training data. We use these weighted gene expressions in a Cox proportional model. The coefficients in the Cox proportional model are the estimated coefficients [alpha] obtained from training data. To separate patients into two groups, we calculate the mean of αXtrain and use this value as a cutoff. For each patient in the test data set, we compare αXtest with the cutoff. Patients with αXtest > EXtrain) belong to the “high-risk” group; the rest belong to the “low-risk” group. We use the log-rank statistic to test the significance of the separation. We call our method using SPC in the model selection step as TSPC and use Lasso in model selection as TCoxL1.
3.1. Simulation study
We performed a simulation study to validate the performance of our method. The simulated data set Xk at time point k consisted of 1000 “genes” (rows) and 100 “patients” (columns). Here we considered three time points. Let equation M6, k [set membership] 1,2,3 denote the “expression level” of the gene g and patient j at time point k. We generated the data as follows:
equation M7
and
equation M8
Here, epsilonb [set membership] N(0, 2.5) and epsilont [set membership] N(0, 2). The μj is a uniform random variable on (0, 1) and I(x) is an indicator function. We introduced the time course structure in genes indexed from 1 to 100.
The survival times were generated as follows. First, we generated the true survival times (denoted by yj, j [set membership] {1, ..., 100}) of the patients. For patient 1 ~ 50, the true survival times were generated as normal random numbers with a mean of 6 and a standard deviation of 1. For patient 51 ~ 100, the true survival times were generated as normal random numbers with a mean of 10 and a standard deviation of 1. Second, we generated censoring times, which were denoted by cj, j [set membership] {1, ..., 100}. For each patient, a censoring time was generated as a normal random number with a mean of 12 and a standard deviation of 2. Third, the observed survival times of patients were generated by tj = min(yj, cj). If the censoring time turned out to be less than the true survival time, the observation was considered as being censored. After generating the training data sets, we generated the test data sets by the same manner independently.
According to the above survival time generation method, patients 1 ~ 50 were defined as “high-risk”, while patients 51 ~ 100 were defined as “low-risk”. We generated the simulation data sets 10 times independently, and applied the prediction on each simulation. For each tuning parameter, we recorded prediction performance and averaged them across 10 times independent simulation.
We compared TSPC using the time course data with SPC using individual time points. The performance of prediction was characterized by the log-rank test p-values. In the original SPC method, the tuning parameter is the Cox-score, which has different ranges for different time points. To make the prediction performance comparable, we used the number of features (genes) as the tuning parameter. As shown in (Figure 1), SPC on individual time points 1, 2, and 3 gives unsatisfied prediction results, irrespective of how many features are used. To investigate whether simply combining the three time points (without considering the time course structure as TSPC does) can generate better results, we generated a big matrix by pooling the simulated gene expression from the three time points. Again, applying SPC on the pooled data set does not increase the prediction performance (Figure 1). On the contrary, TSPC gives much better results over SPC on individual time points, and simply pooled data set across the three time points (Figure 1). It suggests that TSPC successfully captures the time course structures of genes reflecting to the outcome.
Figure 1:
Figure 1:
Methods comparison using simulated data. Black: TSPC, using SPC as the second step; red: SPC 1, SPC using the first time point; green: SPC 2, SPC using the second time point; blue: SPC 3, SPC using the third time point; cyan: SPC C, applying SPC to the (more ...)
We again compared TCoxL1 with the ordinary L1penalty Cox regression model using individual time points. Similarly, TCoxL1 has better performance over the L1penalty Cox regression model on the three individual time points, as well as the simply pooled data set (Figure 2). The prediction performance of TCoxL1, however, is not as high as TSPC. In the subsequent analysis, we focused on the application of TSPC on real data.
Figure 2:
Figure 2:
Methods comparison using simulated data. Black: TCoxL1, using Cox model with L1 penalty as the second step; red: CoxL1 1, using the first time point; green: CoxL1 2, Cox model with L1 using the second time point; blue: CoxL1 3, Cox model with L1 using (more ...)
3.2. Survival analysis of burn patients
We applied our method to burn disease. Patients can have dramatically different outcomes after suffering a burn injury. We aim to uncover the biological reasons and improve the current medical prognosis by building genetic predictors. Patients are monitored for changes in their clinical status, such as organ failure. Blood samples are drawn over the first days, months and years after the injury to obtain gene expression over time. We focus on the survival prediction of the burn patients using the number of days between injury and death or the day we stop monitoring as the outcome. We begin by introducing preprocessing gene expression data, then describe the application of our method on this burn data as well as the comparison with straightforward prediction methods using individual time points.
In the Glue Grant project, a large scale collaborative effort, the blood samples of 123 burn patients were collected and the gene expression levels were measured by the Affymetrix HU133 Plus 2.0 arrays. The data was downloaded from the Glue Grant website (http://www.gluegrant.org/). Sample preparation was described in Zhou et al. (2010). Each array consisted of about 50,000 probe sets. Patients were monitored according to time. In our study, we used the data from the early stage (0 day to 10 days with three days median time) and the middle stage (11 days to 49 days with 19 days median time). Gene expression data was calculated and normalized by dChip (Li and Wong (2001)) and further reduced to 7354 probe sets with the coefficient of variation (standard deviation/mean) larger than 0.8. We used gene expression from early and middle stages to build the predictors. For patients with several measurements during early or middle stages, we took the median gene expression. The outcome of our prediction study is “the death day since the injury” of patients. The maximum death day is 195. If patients survived, we treat them as censored at the 196th day.
We first applied TSPC on the survival analysis of burn patients using the early and middle stage gene expression data set. The 123 patients were randomly divided into a training set with a size of 62 and a test set with a size of 61. The training set was used to learn the weights ([beta with macron above]1 and [beta with macron above]2) for the early stage and the middle stage gene expression, respectively. Predictors were constructed as the weighted gene expression data. Two-fold cross-validation was employed to select the number of predictors from the training set, based on the first principal component. This resulted in 259 genes as the predictors. Then, the first principal component was used for the prediction on the test data. Based on the scores of predictors, patients in the test data set were divided into two groups – the high-risk group and the low-risk group, with the cutoff of the median of scores obtained from training data. We applied the log-rank test on the two group patients. The log-rank test on the test set is very significant with a p-value 4.3 × 10–5 (Figure 3). It suggests that with TSPC the early and middle stage gene expression levels of burn patients are predictive for the survival time.
Figure 3:
Figure 3:
Log-rank test on the test data set of burn patients by TSPC.
We next investigated whether the predictions of TSPC were implicated in other important clinical outcomes. We studied the distributions of the clinical outcomes on the TSPC predictions of the high-risk and low-risk patients. In figure 4 we plotted the distribution of continuous clinical outcomes on the high-risk and low-risk patients. The two clinical outcomes are the maximum Denver multiple organ failure (MOF) score (MAX_DENVER_2_SCORE) which grades four organ dysfunctions (lung, kidney, liver, and heart), acute care length of stay and discharged to inpatient rehabilitation (ACUTE_CARE_LENGTH_OF_STAY). One-sided Wilcox-test indicates that high-risk patients have higher MOF (p-value = 1.5 × 10–3). Figure 5 illustrates the distribution of categorical outcomes on predicted high-risk and low-risk patients. It shows that high-risk patients have higher death rate, higher noninfectious complication rate, lower acute care and lower burn wound infection. The results indicate that the TSPC predictions are suggestive of related clinical outcomes.
Figure 4:
Figure 4:
Box plots of continues outcomes of burn patients from predicted high-risk and low-risk patients.
Figure 5:
Figure 5:
Bar plots of categorical outcomes of burn patients from predicted high-risk and low-risk patients. Left top, red bar reflects the number of patients survived; blue bar reflects the number of patients died. Left bottom: red bar reflects the number of patients (more ...)
We also studied the biological meanings of the 259 selected predictor genes. We used the Ingenuity pathway analysis tool (http://www.ingenuity.com/) to perform functional analysis on the genes. The enriched canonical pathways were shown in Figure 6. These pathways are involved in the functions of signal transduction, inflammatory response, immune response functions, etc., consistent with the nature of the burn disease. We showed the heatmap of the gene expression of the “response to wounding” and “immune response” pathways in Figure 7. It demonstrates that although either of the early and middle stages of gene expression are slightly distinguishable between the high-risk and low-risk patients, the weighted gene expression levels are very distinct between the two groups.
Figure 6:
Figure 6:
Enriched canonical pathways for survival predictors of burn patients
Figure 7:
Figure 7:
Heatmap of gene expression in the “response to wounding” and “immune response” pathways. Left panel, gene expression from early stage and middle stage; right panel, the projected gene expression. For each heatmap, rows (more ...)
The weights of the early and middle stage expression of the 259 genes were shown in Figure 8. The amplitude of ([beta with macron above]1, [beta with macron above]2) reflects the contribution of gene expression from each time point. The sign of [beta with macron above]1 × [beta with macron above]2 reflects the relationship between two time points. If the sign of [beta with macron above]1 × [beta with macron above]2 is positive, it means that the two time points have additive effects on the outcome. If the sign of [beta with macron above]1 × [beta with macron above]2 is negative, it suggests that the outcome is related to the trend of that gene along time. The amplitude of the weights acts as a significant factor to the time course effect because the odd ratio is 2.24 with 95% confidence interval (1.16, 4.33). The percentage of contrary-sign weights is 0.47 within the “inflammatory response” pathways; while the percentage of | [beta with macron above]2 |>| [beta with macron above]1 | is 0.84. This is consistent with the fact shown in Figure 7 that gene expression of the patients with different outcomes in the middle stage is higher than the early stage.
Figure 8:
Figure 8:
Weights of the early stage and the second stage for the projection. X-axis: weights of the early stage; Y-axis: weights of the middle stage. Top panel: histogram of the weights from the early stage; right panel: histogram of the weights from the second (more ...)
To investigate the significance of gene expression on burn patients’ survival analysis, we compared the prediction performance of longitudinal gene expression data with those of clinical variables. Ryan, Schoenfeld, and et al. (1998) studied the probability of death of burn patients based on clinical variables such as age, sex, and the injury percentage of the body-surface area, etc. We performed survival analysis using the univariate Cox model for the clinical variables. The p-values based on the Wald test and the Hazard ratio were calculated for each clinical variable, as well as the longitudinal gene expression levels. As shown in Table 1, the gene predictors are much better than the tested clinical variables.
Table 1:
Table 1:
Comparing with clinical variables.
Finally, to check whether predictors using the longitudinal gene expression are better than using the individual time points, we compared the prediction performance of TSPC and SPC on the early or middle stages or the pooled data from the two stages. We repeated the prediction by randomly splitting the 123 samples into the training and test sets for 30 times. In each randomization, the model parameters were learned from the training set and tested on the test set for all methods. The averaged prediction performance on test data was shown in Figure 9. One can see that our method using the longitudinal gene expression has the best performance.
Figure 9:
Figure 9:
Methods comparison using burn data. Black: TSPC; red: SPC E, SPC using the early time point; green: SPC M, SPC using the middle time point; blue: SPC C, applying SPC to the data set combing the early and middle time points. Y-axis is the −log(pvalue (more ...)
We have proposed a new statistical prediction method in the survival analysis scenario using the longitudinal gene expression. The key idea is first projecting time course gene expression to an “optimal direction”, then incorporating model selection and dimension reduction strategies to extract predictors. As the weights for the projection of test data are the same weights obtained from training data, the homogenousness of data is critical. If there are strong batch effects between training data and test data, the estimated “optimal direction” of test data can be very much different from true “optimal direction”. Batch effects should be removed in the first place. In the second stage, we used two particular methods – SPCA and Cox regression with L1 penalty. However, procedures of the second stage are not limited to these two methods. Other methods like forward stagewise regression and boosting methods can be used to do variable selection. One can choose an appropriate variable selection method depending on the situation and preference. Besides of survival outcomes, it will be interesting to predict other types of outcomes such as binary, category and continuous outcomes using longitudinal gene expression in future.
Our studies on the burn data and simulated data show that appropriate use of longitudinal structure of gene expression can improve the power of prediction. For the clinical study perspective, it is straightforward to monitor patient status and measure the time course gene expression. The genome-wide dynamic regulation of gene expression is attracting more and more interest. As more time course data are generated, our prediction approach will have wider applications.
Footnotes
*We wish to acknowledge the efforts of many individuals at participating institutions of the Glue Grant Program that generated the clinical and genomic data reported here. This study was supported by NIH U54 GM-062119 and P01-HG000105.
  • Aastveit AH, Martens H. “Anova interactions interpreted by partial least-squares regression” Biometrics. 1986;42(4):829–844. doi: 10.2307/2530697. [Cross Ref]
  • Akdeniz F, Tabakan G. “Restricted ridge estimators of the parameters in semiparametric regression model” Communications in Statistics-Theory and Methods. 2009;38(11):1852–1869.
  • Bair E, Hastie T, et al. “Prediction by supervised principal components” Journal of the American Statistical Association. 2006;101(473):119–137. doi: 10.1198/016214505000000628. [Cross Ref]
  • Bair E, Tibshirani R. “Semi-supervised methods to predict patient survival from gene expression data,” PLoS Biol. 2004;2(4):E108. doi: 10.1371/journal.pbio.0020108. [PMC free article] [PubMed] [Cross Ref]
  • Chun H, Keles S. “Sparse partial least squares regression for simultaneous dimension reduction and variable selection” Journal of the Royal Statistical Society Series B-Statistical Methodology. 2010;72:3–25. doi: 10.1111/j.1467-9868.2009.00723.x. [PMC free article] [PubMed] [Cross Ref]
  • Cox DR. “Regression models and life tables” Journal of the Royal Statistical Society Series B. 1972;34(2):187–220.
  • Cramer RD, Bunce JD, et al. “Cross-validation, bootstrapping, and partial least-squares compared with multiple-regression in conventional qsar studies” Quantitative Structure-Activity Relationships. 1988;7(1):18–25. doi: 10.1002/qsar.19880070105. [Cross Ref]
  • Desmedt C, Haibe-Kains B, et al. “Gene expression signatures can predict the efficacy of anthracyclines in her2-negative and her2-positive breast cancer (bc) patients: The results of the top trial and their validation in the big1-00 trial, on behalf of the top trial investigators” Annals of Oncology. 2009:45–45. 20.
  • Faigle JF, Poppi RJ, et al. “Multicomponent principal component regression and partial least-squares analyses of overlapped chromatographic peaks” Journal of Chromatography. 1991;539(1):123–132. doi: 10.1016/S0021-9673(01)95365-8. [Cross Ref]
  • Farkas O, Heberger KR. “Comparison of ridge regression, partial least-squares, pairwise correlation, forward- and best subset selection methods for prediction of retention indices for aliphatic alcohols” Journal of Chemical Information and Modeling. 2005;45(2):339–346. doi: 10.1021/ci049827t. [PubMed] [Cross Ref]
  • Forrester JB, Kalivas JH. “Ridge regression optimization using a harmonious approach” Journal of Chemometrics. 2004;18:7–8. 372–384. doi: 10.1002/cem.883. [Cross Ref]
  • Friedman J, Hastie T, et al. “Regularization paths for generalized linear models via coordinate descent. journal of statistical software. 2010;33(1) ”. [PMC free article] [PubMed]
  • Gana R. “Ridge-regression estimation of the linear probability model. journal of applied statistics. 1995;22(4):537–539. doi: 10.1080/757584790. ”. [Cross Ref]
  • Helland IS. “Partial least-squares regression and statistical-models” Scandinavian Journal of Statistics. 1990;17(2):97–114.
  • Hertz D. “Sequential ridge-regression” Ieee Transactions on Aerospace and Electronic Systems. 1991;27(3):571–574. doi: 10.1109/7.81440. [Cross Ref]
  • Hoerl AE, Kennard RW. “Ridge-regression - degrees of freedom in the analysis of variance” Communications in Statistics-Simulation and Computation. 1990;19(4):1485–1495. doi: 10.1080/03610919008812931. [Cross Ref]
  • Holcomb TR, Hjalmarsson H, et al. “Significance regression: A statistical approach to partial least squares” Journal of Chemometrics. 1997;11(4):283–309. doi: 10.1002/(SICI)1099-128X(199707)11:4<283::AID-CEM475>3.0.CO;2-3. [Cross Ref]
  • Hu HC. “Ridge estimation of a semiparametric regression model” Journal of Computational and Applied Mathematics. 2005;176(1):215–222. doi: 10.1016/j.cam.2004.07.032. [Cross Ref]
  • Jacobsen T, Kolset K, et al. “Partial least-squares regression and fuzzy clustering - a joint approach” Mikrochimica Acta. 1986;2:1–6. 125–138.
  • Jain RK. “Ridge regression and its application to medical data” Comput Biomed Res. 1985;18(4):363–368. doi: 10.1016/0010-4809(85)90014-X. [PubMed] [Cross Ref]
  • Karlsson E, Delle U, et al. “Gene expression variation to predict 10-year survival in lymph-node-negative breast cancer,” Bmc Cancer. 2008;8 doi: 10.1186/1471-2407-8-254. [PMC free article] [PubMed] [Cross Ref]
  • Korde LA, Lusa L, et al. “Gene expression pathway analysis to predict response to neoadjuvant docetaxel and capecitabine for breast cancer” Breast Cancer Research and Treatment. 2010;119(3):685–699. doi: 10.1007/s10549-009-0651-3. [PubMed] [Cross Ref]
  • Lecessie S, Vanhouwelingen JC. “Ridge estimators in logistic-regression” Applied Statistics-Journal of the Royal Statistical Society Series C. 1992;41(1):191–201.
  • Li C, Wong WH. “Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. proc natl acad sci u s a. 2001;98(1):31–36. doi: 10.1073/pnas.011404098. ”. [PubMed] [Cross Ref]
  • Lin YH, Friederichs J, et al. “Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer” Clinical Cancer Research. 2007;13(2):498–507. doi: 10.1158/1078-0432.CCR-05-2734. [PubMed] [Cross Ref]
  • Ma P, Zhong W, Liu JS. “Identifying differentially expressed genes in time course microarray data” Statistics in Biosciences. 2009;1(2):144–159. doi: 10.1007/s12561-009-9014-1. [Cross Ref]
  • Martens IL, H, et al. “Partial least-squares regression on design variables as an alternative to analysis of variance” Analytica Chimica Acta. 1986;191:133–148. doi: 10.1016/S0003-2670(00)86303-5. [Cross Ref]
  • Maruyama Y, Strawderman WE. “A new class of generalized bayes minimax ridge regression estimators” Annals of Statistics. 2005;33(4):1753–1770. doi: 10.1214/009053605000000327. [Cross Ref]
  • Naderi A, Teschendorff AE, et al. “A gene-expression signature to predict survival in breast cancer across independent data sets” Oncogene. 2007;26(10):1507–1516. doi: 10.1038/sj.onc.1209920. [PubMed] [Cross Ref]
  • Nguyen DV, Rocke DM. “Partial least squares proportional hazard regression for application to dna microarray survival data” Bioinformatics. 2002;18(12):1625–1632. doi: 10.1093/bioinformatics/18.12.1625. [PubMed] [Cross Ref]
  • Nimeus-Malmstrom E, Ritz C, et al. “Gene expression profilers and conventional clinical markers to predict distant recurrences for premenopausal breast cancer patients after adjuvant chemotherapy” European Journal of Cancer. 2006;42(16):2729–2737. doi: 10.1016/j.ejca.2006.06.031. [PubMed] [Cross Ref]
  • Ozkale MR. “A jackknifed ridge estimator in the linear regression model with heteroscedastic or correlated errors” Statistics and Probability Letters. 2008;78(18):3159–3169. doi: 10.1016/j.spl.2008.05.039. [Cross Ref]
  • Pandit TS, Kennette W, et al. “Lymphatic metastasis of breast cancer cells is associated with differential gene expression profiles that predict cancer stem cell-like properties and the ability to survive, establish and grow in a foreign environment” International Journal of Oncology. 2009;35(2):297–308. [PubMed]
  • Park M, Hastie T. “L1-regularization path algorithm for generalized linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2007;69:659–677. doi: 10.1111/j.1467-9868.2007.00607.x. [Cross Ref]
  • Pliskin JL. “The mean squared error of the generalized ridge-regression estimator and the orientation of beta” Communications in Statistics-Simulation and Computation. 1990;19(4):1477–1484. doi: 10.1080/03610919008812930. [Cross Ref]
  • Ryan CM, Schoenfeld DA, et al. “Objective estimates of the probability of death from burn injuries. n engl j med. 1998;338(6):362–366. doi: 10.1056/NEJM199802053380604. ”. [PubMed] [Cross Ref]
  • Schneeweiss A, Lichter P, et al. “Gene expression profiling to predict chemotherapy response in primary breast cancer” Breast Care. 2006;1(6):362–367. doi: 10.1159/000097997. [Cross Ref]
  • Schorge JO, DRD, et al. “Gene expression profiling to predict early relapse in ovarian cancer,” Gynecologic Oncology. 2009;112(2):S105–S106.
  • Storey JD, Xiao W, et al. “Significance analysis of time course microarray experiments” Proc Natl Acad Sci U S A. 2005;102(36):2837–1284. doi: 10.1073/pnas.0504609102. 12. [PubMed] [Cross Ref]
  • Tai ST, YC “A multivariate empirical bayes statistic for replicated microarray time course data” Annals of Statistics. 2006:2387–2412. 34. doi: 10.1214/009053606000000759. [Cross Ref]
  • Tibshirani R. “Regression shrinkage and selection via the lasso” Journal of the Royal Statistical Society Series B-Methodological. 1996;58(1):267–288.
  • Vangaans PFM, Vriend SP. “Multiple linear-regression with correlations among the predictor variables - theory and computer algorithm ridge (fortran-77)” Computers and Geosciences. 1990;16(7):933–952. doi: 10.1016/0098-3004(90)90104-2. [Cross Ref]
  • Wakeling IN, Morris JJ. “A test of significance for partial least-squares regression. journal of chemometrics. 1993;7(4):291–304. doi: 10.1002/cem.1180070407. ”. [Cross Ref]
  • Yeh CH, Spiegelman CH. “Partial least-squares and classification and regression trees. chemometrics and intelligent laboratory systems. 1994;22(1):17–23. doi: 10.1016/0169-7439(93)E0045-6. ”. [Cross Ref]
  • Young PJ. “A reformulation of the partial least-squares regression algorithm” Siam Journal on Scientific Computing. 1994;15(1):225–230. doi: 10.1137/0915015. [Cross Ref]
  • Yuan KC, M “Hidden markov models for microarray time course data under multiple biological conditions (with discussion)” Journal of the American Statistical Association. 2006;101:1323–1340. doi: 10.1198/016214505000000394. [Cross Ref]
  • Yuan YY, Li CT, et al. “Partial mixture model for tight clustering of gene expression time-course,” Bmc Bioinformatics. 2008;9 doi: 10.1186/1471-2105-9-287. [PMC free article] [PubMed] [Cross Ref]
  • Zhou B, Xu W, Herndon D, Tompkins R, Davis R, Xiao W, Wong W. “Analysis of factorial time-course microarrays with application to a clinical study of burn injury,” Proceedings of the National Academy of Sciences. 2010;107:9923. doi: 10.1073/pnas.1002757107. [PubMed] [Cross Ref]
Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of
Berkeley Electronic Press