|Home | About | Journals | Submit | Contact Us | Français|
Current studies indicate that long non-coding RNAs (lncRNAs) are frequently aberrantly expressed in cancers and implicated with prognosis in gastric cancer (GC). We intended to generate a multi-lncRNA signature to improve prognostic prediction of GC. By analyzing ten paired GC and adjacent normal mucosa tissues, 339 differentially expressed lncRNAs were identified as the candidate prognostic biomarkers in GC. Then we used LASSO Cox regression method to build a 12-lncRNA signature and validated it in another independent GEO dataset. An innovative 12-lncRNA signature was established, and it was significantly associated with the disease free survival (DFS) in the training dataset. By applying the 12-lncRNA signature, the training cohort patients could be categorized into high-risk or low-risk subgroup with significantly different DFS (HR = 4.52, 95%CI= 2.49-8.20, P < 0.0001). Similar results were obtained in another independent GEO dataset (HR=1.58, 95%CI=1.05 - 2.38, P=0.0270). Further analysis showed that the prognostic value of this 12-lncRNA signature was independent of AJCC stage and postoperative chemotherapy. Receiver operating characteristic (ROC) analysis showed that the area under receiver operating characteristic curve (AUC) of combined model reached 0.869. Additionally, a well-performed nomogram was constructed for clinicians. Moreover, single-sample gene-set enrichment analysis (ssGSEA) showed that a group of pathways related to drug resistance and cancer metastasis significantly enriched in the high risk patients. A useful innovative 12-lncRNA signature was established for prognostic evaluation of GC. It might complement clinicopathological features and facilitate personalized management of GC.
GC is a common and highly lethal malignancy, being the fourth most common cancer and the second leading cause of cancer death in the world 1. Although the tendency of incidence rates declines, it is still concerned worldwide with the highest estimated mortality rates in Eastern Asian 2. Surgery is the only curative treatment strategy and conventional chemotherapy has shown limited efficacy. Despite the recent therapeutic advances, the overall outcome of GC remains undesirable 3, 4. For the risk stratification of GC, the TNM Staging System has been widely used, which is developed and maintained by American Joint Committee on Cancer (AJCC) and adopted by the Union International Committee on Cancer (UICC). Although TNM staging system is of great value clinically, it has not adequate prognostic and predictive capabilities to guide patient management 5, 6. Thus, new biomarkers are needed to discriminate the high-risk patients with GC and consequently improve personalized cancer care.
Currently, substantial studies have focused on the roles of dysregulated functional long non-coding RNAs (lncRNAs) in human cancers 7, 8. LncRNAs range from 200 nucleotides to multiple kilobases in length, but have no protein-coding capability 9. The aberrant expression of lncRNAs is implicated in diverse cancers and some of them act as biomarkers for diagnosis and prognostication 10, 11. Compared with single biomarker, integrating multiple biomarkers into a single model would be much better 12, 13, so it is of concrete predictive and prognostic value to identify a multi-lncRNA signature in GC.
Presently, a large group of lncRNA-specific probes were represented on the commonly used microarray platform (Affymetrix HG-U133 plus 2.0), we mined previously published gene expression microarray data from the Gene Expression Omnibus (GEO), and conducted lncRNA profiling on large cohorts of GC patients. The Cox proportional hazards regression analysis is one common approach to assess the prognostic factors in survival analysis, however, it is not suitable for high-dimensional microarray data when the ratio between sample size and variables is too low 14. The least absolute shrinkage and selection operator method (LASSO) can conquer this limitation and has been widely adopted for optimal selection of prognostic genes 15-17. By this way, we identified a 12-lncRNA signature in training set GSE62254 to predict survival probability for patients with GC. We validated it in another independent set GSE15459, and assessed the prognostic value and accuracy of this classifier in training set.
GC gene expression data and corresponding clinical information data used in this study were obtained from the publicly available GEO (http://www.ncbi.nlm.nih.gov/geo/). The gene expression data were from the same chip platform (Affymetrix Human Genome U133 Plus 2.0 chips). Dataset GSE79973 including ten paired GC and normal mucosa tissues was used to identify the differentially expressed lncRNAs. First, the training set (GSE62254) was used to screen out the prognostic multi-lncRNA signature from the differentially expressed lncRNAs by LASSO Cox regression model. Then the GC samples in GSE15459 were analyzed as an independent validation set. After filtering out samples without clinical survival information, there were a total of 491 samples, including 300 from GSE62254, 191 from GSE15459 (Table S1). Supplementary Fig. 1 depicts the schematic diagram of work flow.
The raw CEL files were downloaded from GEO database and background adjusted using Robust Multichip Average (RMA) 18, which was a potent measure tool for lncRNA profiling data 19. The lncRNA profile mining approach was mainly described by Zhang et al. 20. First, the Affymetrix HG-U133 Plus 2.0 probe set IDs was mapped to the NetAffx Annotation Files. Second, based on the Refseq transcript ID and/or Ensembl gene ID, non-coding protein genes were extracted and were further filtered through excluding pseudogenes. Finally, we produced the 2448 lncRNA transcripts annotated with corresponding Affymetrix probe IDs.
The nomogram and calibration plots were generate using “rms” package of R software (version 3.3.1). Calibration was used to assess the performance of the nomogram. Nomogram-predicted survival and observed outcome were plotted on the x-axis and y-axis respectively, and the 45-degree line represented the best prediction. ROC analysis was also performed to estimate the predictive accuracy of the DFS nomogram using the “pROC” package of R software. Additionally, decision curve analysis (DCA) was also performed to assess the clinical utility of the nomogram. The DCA could be used to assess and compare prediction models which incorporated clinical consequences 21, 22. The x-axis indicated the percentage of threshold probability, and the y-axis represented the net benefit.
Single-sample gene-set enrichment analysis (ssGSEA) was performed to identify the differentially expressed gene sets between the low and high- risk cohorts. The enrichment score stands for the degree of absolute enrichment of a gene set in each sample within a certain dataset 23, 24. Using GSVA package and its ssGSEA method (http://www.bioconductor.org), enrichment scores in each sample were calculated as the normalized difference in empirical cumulative distribution functions of gene expression ranks inside and outside the gene set 25. The most significantly differentially expressed gene sets (p-value <0.001) were generated for further analysis.
We used the R software version 3.0.2 and the “glmnet” package (R Foundation for Statistical Computing, Vienna, Austria) to perform the LASSO Cox regression model analysis. The risk scores were calculated according to the formula generated through LASSO Cox regression model. Using the median risk score as the cutoff point, patients in each dataset were divided into low-risk or high-risk group correspondingly. For the outcome analysis, five-year recurrence was the primary endpoint in GSE62254 26, and DFS was defined as the time of surgery to the first confirmed relapse; Overall survival (OS) was measured in GSE15459 27. Survival differences between the low-risk and high-risk groups in each dataset were assessed by the Kaplan-Meier estimator and the log-rank test. To test whether the risk score was independent of AJCC stage, or postoperative chemotherapy, we conducted multivariable Cox regression and stratification analysis. ROC analysis was introduced to assess the sensitivity and specificity of the survival prediction based on the risk score, AJCC stage, and the combined model of risk score and AJCC stage. To generate ROC curves, the patients whose durations were less than the median DFS needed to be excluded, if they still did not recur at last follow-up. The rest patients were classified as surviving either longer or shorter than the median DFS 28. During all the statistical analysis, including the log-rank test, Cox regression analysis and ROC analysis, P value being less than 0.05 was defined as the significant difference.
By using “limma” package, we identified a set of 339 differentially expressed lncRNAs whose parameter p-value was less than 0.01 from dataset GSE79973 (Supplementary Fig. 2). We further analyzed those 339 genes by LASSO Cox regression model in the training dataset GSE62254 (Supplementary Fig. 3). Consequently, we identified the 12-lncRNA signature that was significantly correlated with DFS in GC patients. Table Table11 showed a list of probes with their obtained coefficients which were derived from the LASSO analysis. The higher risk score indicated unfavorable prognosis in GC. Thus, the higher expression levels of seven genes with positive coefficients indicated (CHST9-AS1, TPT1-AS1, MIR100HG, LOC400043, LINC00340, LOC283174, LOC401093) meant higher risk score and accordingly worse outcome. The negative coefficients for the remaining five genes (ENSG00000251538, LOC100133985, Hs.93194, ENSG00000233236, ENSG00000229565) indicated that their higher levels of expression were associated with better prognosis.
A risk-score formula was created according to the expression of these 12 lncRNAs for DFS prediction, as follows: Risk score = (0.1243*expression level of CHST9-AS1) + (-0.4656*expression level of ENSG00000251538) + (0.2788*expression level of TPT1-AS1) + (0.0340*expression level of MIR100HG) + (0.1696*expression level of LOC400043) + (0.0243*expression level of LINC00340) + (0.0051*expression level of LOC283174) + (-0.5749*expression level of LOC100133985) + (-0.0659*expression level of Hs.93194) + (0.0008*expression level of LOC401093) + (-1.3684*expression level of ENSG00000233236) + (-0.0054*expression level of ENSG00000229565). We then calculated the 12-lncRNA signature risk score for each patient in the training dataset GSE62254. The patients were classified into low-risk group (n=150) and high-risk group (n=150) using the median risk score as the cutoff point. Patients in the high-risk group had significantly shorter median DFS than those in the low-risk group (HR = 4.52, 95%CI= 2.49-8.20, P < 0.0001) (Figure (Figure1A).1A). The association of the 12-lncRNA risk score with DFS was also significant when it was assessed by the multivariable Cox regression model as a continuous variable (HR = 6.9340, 95%CI= 3.655-13.156, P < 0.0001) (Table (Table2).2). As shown in Figure Figure2A,2A, there were apparently more recurred patients in high risk group, and the distribution of Z-score transformed risk score observably shifted to right in recurred patients compared with recurrence-free ones (Figure (Figure22B).
To verify the ability of the 12-lncRNA signature in predicting the survival of GC patients, we further validated our findings in another independent dataset GSE15459, which yielded the similar results as above. Through the same risk score-based classification, patients were categorized into high-risk group (N=95) and low-risk group (N=96). Patients with GC in high-risk group had significantly shorter median OS than those in low-risk group (HR=1.58, 95%CI=1.05 - 2.38, P=0.0270) (Figure (Figure1B).1B). The multivariable Cox regression analysis showed that the 12-lncRNA risk score also had statistical significance as a continuous variable in the GSE15459 cohorts (HR=1.476, 95%CI=1.071 - 2.037, P=0.0175) (Table (Table22).
We performed multivariable Cox regression analysis in the two datasets. The 12-lncRNA risk score and other clinicopathological factors, including age, gender, AJCC stage and postoperative chemotherapy were used as covariates. It showed that even adjusted by AJCC stage and other covariates in each dataset, the 12-lncRNA risk score remained to be significantly associated with patients' survival (P < 0.0001 in GSE62254, P = 0.0175 in GSE15459) (Table (Table2).2). Consistent with risk score, AJCC stage was also significantly associated with patients' survival (Table (Table2).2). In order to test whether the prognostic value of the 12-lncRNA signature was independent of AJCC stage, stratification analysis was performed. Patients in dataset GSE62254 (N=300) were factitiously stratified into early stage stratum (stage I&II) and late stage stratum (stage III&IV). The results showed that 12-gene risk score remained the ability of predicting the prognosis within each stage stratum. Figure Figure3A3A showed that high-risk patients in early stage stratum had significantly shorter median DFS than low-risk patients (HR = 2.22, 95%CI= 1.42-3.48, P = 0.0002), patients in late stage stratum yielded similar results (HR = 7.08, 95%CI= 1.65-30.32, P = 0.0004) (Figure (Figure3B),3B), indicating that the prognostic value of 12-lncRNA signature was independent of AJCC stage. We also investigated whether the 12-lncRNA signature was independent of postoperative chemotherapy. The same approaches were adopted as above. Multivariable Cox regression analysis showed that postoperative chemotherapy was also significantly associated with DFS in dataset GSE62254 (HR = 0.468, 95%CI= 0.271-0.809, P = 0.0065) (Table (Table2),2), indicating that postoperative chemotherapy was a protective factor. Figure Figure3C3C showed that patients with postoperative chemotherapy in the low-risk group had significantly longer median DFS than high-risk patients (HR = 2.25, 95%CI= 1.21-4.17, P = 0.0067), similar results was generated in patients without postoperative chemotherapy (HR = 3.58, 95%CI= 2.00-6.42, P < 0.0001) (Figure (Figure3D).3D). The results indicated that regardless of the postoperative chemotherapy status, the 12-lncRNA signature could discriminate high risk GC patients from low risk ones.
Additionally, ROC analysis was performed to demonstrate the sensitivity and specificity of survival prediction. AUC was evaluated and compared between the 12-lncRNA risk score model and AJCC stage. As shown in Figure Figure4,4, both AJCC stage and 12-lncRNA risk score model owned valuable predicted power to estimate the prognosis of GC patients, and there was no significant difference between them. If combined the 12-lncRNA risk score model with AJCC stage together, the AUC of combined model was significantly greater than that of AJCC stage alone (0.869 versus 0.758, 95%CI: 0.665-0.851, P =0.0152).
In order to provide a quantitative method for the clinicians to predict the probability of 3-year DFS in GC, a nomogram was constructed in GSE62254 dataset which integrated the 12-lncRNA signature, age, tumor stage and the Lauren classification (Figure (Figure5A).5A). Figure Figure5B5B showed that the nomogram did well in the calibration plots compared with the ideal model. ROC analysis was performed to calculate the predictive accuracy of the nomogram, and the AUC of nomogram is 0.8699(Figure 0.8699(Figure5C).5C). DCA was introduced to estimate the clinical utility of the nomogram, and it performed well as shown in Figure Figure55D.
To identify the 12-lncRNA associated pathways, we performed ssGSEA to analyze the GSE62254 dataset using risk score for classification. As shown in Figure Figure6,6, a group of pathways related to drug resistance and cancer metastasis significantly enriched in the high risk patients; however, some apoptosis-related pathways were up-regulated in the low risk cohorts. These pathways were found to be significantly associated with the risk score, which was validated through Pearson's correlation analysis (Figure (Figure7A).7A). As cancer metastasis is an important factor that exerts influence on the disease occurrence and patients' prognosis 29, the risk scores were further analyzed in patients with and without distant metastases in GSE62254 dataset (TNM stage IV patients were considered to be with distant metastases). In accordance with the results above, patients were inclined to get higher risk scores in the cohorts with distant metastases compared to the ones without distant metastases (Figure (Figure77B).
Numerous reports indicate that dysregulated lncRNA expression may be implicated in various aspects of tumor, including carcinogenesis, progression, and metastasis 30-32. Some lncRNAs are considered to be useful biomarkers to predict prognosis in GC patients, such as HULC and LINC00668 11, 33. However, several limits are still concerned including small number of lncRNAs screened, inadequate samples, and lack of independent validation, the reliability and utility of prognostic predication in GC need further investigation. To establish a prognostic multi-lncRNA signature, we mined the existing microarray gene expression data to profile lncRNAs. In our study, LASSO analysis was introduced, which was a popular tool for regression with high-dimensional predictors 17. By exploring the relevance between lncRNA expression profiles and clinical outcome of GC patients in dataset GSE62254, we constructed a 12-lncRNA signature that was significantly associated with patients' DFS.
In this study, a novel prognostic 12-lncRNA signature was developed and validated to improve the ability of predicting prognosis for GC patients. Our results revealed that this classifier could successfully classify GC patients into high-risk and low-risk groups with significant differences in DFS in training set. The prognostic value of this 12-lncRNA signature could be verified in another independent dataset GSE15459, indicating the reproducibility and utility of this multi-lncRNA signature for the prognostic prediction in GC.
Stratification analysis revealed that prognostic power of this 12-lncRNA signature was independent of AJCC stage, which was currently the most important prognostic factor for GC. AJCC staging system could provide effective prognostic information and contribute to the selection of proper therapeutic regimen. Our study revealed that AJCC stage was a strong prognostic factor through the multivariable Cox regression analysis, which was consistent with previous studies 34, 35. Therefore, it was necessary to further evaluate whether the prognostic value of 12-lncRNA signature was independent of AJCC stage. The patients were divided into early stage and late stage stratums factitiously to facilitate analysis. The 12-lncRNA signature could successfully divide the stratified patients into low risk and high risk subgroups in GSE62254, and there was a clear separation in the survival curves between them. Based on these results, we could conclude that the prognostic value of the 12-lncRNA signature was independent of AJCC stage in our study.
Moreover, 12-lncRNAs risk model remained strong prognostic ability when stratified by postoperative chemotherapy. In Asian countries, postoperative chemotherapy have been extensively used as the standard treatment, and two import clinic trials showed that patients could benefit from postoperative chemotherapy with prolonged survival 36, 37. It was consistent with our results of multivariable Cox regression analysis. Further stratification analysis demonstrated that the 12-lncRNA signature could also allow a discrimination of GC patients' prognosis, having nothing with its postoperative chemotherapy stratum. This further demonstrated that the 12-lncRNA signature might be an independent prognostic factor for GC.
In order to assess the predictive ability of the 12-lncRNA signature, ROC analysis was performed. An AUC was used as a measure of the accuracy in diagnostic test 38. ROC analysis revealed that the 12-lncRNA signature had a similar survival predictive power as AJCC stage. Interestingly, the prognostic power was superior to AJCC stage alone when we combined 12-lncRNAs risk model with AJCC stage together. Moreover, the AUC of combined model reached 0.869, indicating it might complement clinicopathological features and improve the accuracy of prognostic prediction in GC.
As the 12-lncRNA signature could discriminate the patients with high risk of recurrence from GC patients, we hypothesized that this gene signature might be associated with some signaling pathways that could impact the prognosis of GC. Currently, cancer metastasis and drug resistance were the main challenges in clinical practice and badly affected patients' prognosis 29, 39. Interestingly, according to the results of ssGSEA, these pathways were highly enriched in the high risk group. Furthermore, our finding showed that the risk score was closely related to these pathways, providing some insight into the molecular mechanisms that underlie the pathological process and cancer progression in GC.
Our study has showed that the 12-lncRNA signature was strongly associated with the prognosis of GC. However, the biological functions of 12 lncRNAs have not been clarified completely in GC. Some of the lncRNAs used in our signature have been reported in previous studies. MIR100HG acted as regulators of hematopoiesis and oncogenes in myeloid leukemia 40. LOC400043 was one of “miRNA sponges”, it controlled several biological functions via sequestering miR-28-3p and miR-96-5p 41. Interestingly, LINC00340 was found to act both as a tumor suppressor and pro-metastasis factor in cancer 42, 43. The reports with respect to the other lncRNAs have been extremely rare, further researches about the biological functions of the lncRNAs are needed. In our study, MIR100HG, LINC00340, LOC283174, LOC401093 are up-regulated in GC samples compared with their paired normal tissues and correlated with shorter survival, indicating a detrimental role in GC biogenesis. Contrariwise, ENSG00000251538, LOC100133985, Hs.93194, ENSG00000233236, ENSG00000229565 are down-regulated in GC and might be protective factors. There may be some biases in the course of selecting prognostic multi-lncRNA signature from view of biogenesis; however, because of its strong relevance with prognosis, the roles of these genes deserve further investigations, especially in GC.
Some limitations in our study need to be acknowledged. First, the primary endpoints in the two datasets are not exactly the same. DFS was used in the training set, but OS in the validation set, so the robustness of the 12-lncRNA signature on prognostic prediction requires further validation in large prospective clinic trials. Second, we have no experimental data about the lncRNAs in the signature, and most of which have been rarely reported, so it is in need of more evidence to elucidate the inherent association between the 12-lncRNA signature and prognosis in GC. Despite these drawbacks, our findings still showed the significant and consistent correlation of the 12-lncRNA signature with OS in independent dataset, implying it is a useful prognostic biomarker for GC.
In conclusion, we have generated an innovative prognostic 12-lncRNA signature in GC. It might complement clinicopathological features and facilitate personalized management of GC. In future, large-scale prospective researches are needed to further assess the robustness of this signature before clinical application, and the underlying biological mechanisms associated with this signature warrant further study.
This project was supported by grants from the National Natural Science Foundation of China (Grant No: 81421001, 81320108024, 31371420, 81530072, 81402347 and 31371273); The Program for Professor of Special Appointment (2015 Youth Eastern Scholar NO. QD2015003) at Shanghai Institutions of Higher Learning; Shanghai Municipal Education Commission-Gaofeng Clinical Medicine Grant Support (NO.20161309); and Chenxing Project of Shanghai Jiao-Tong University to H. Chen.