In cancer studies with high-throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta-analysis and single-dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within- and across-dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across-dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance.
Integrative analysis; Contrasted penalization; Marker selection; High-throughput cancer studies
In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods.
cancer diagnosis studies; composite penalization; gene expression; integrative analysis
Illness and the medical expenditure that follows have a profound impact on the well-being of individuals and households. China is a huge country with significant regional differences. The goal of this study is to investigate the associations of illness and medical expenditure with other categories of household expenditures, with special attention paid to the differences in observations between the western and eastern regions.
A survey was conducted in six major cities in China, three in the east and three in the west, in 2011. Data on demographics, illness conditions, and medical and other expenditures were collected from 12,515 households.
In the analysis of the associations of illness conditions and medical expenditure with demographics, multiple significant associations were observed, and there are differences between the eastern and western regions. In univariate analyses, illness conditions and medical expenditure were found as having significant associations with other categories of expenditures. In multivariate analyses adjusting for household and household head characteristics, few associations were observed, and there exist differences between the regions.
This study has provided empirical evidence on the associations of illness/medical expenditure with demographics and with other categories of expenditures. Differences across regions were observed in multiple aspects. The reasons underlying such differences are worth investigating further.
Illness condition; Medical expenditure; Household expenditure; Cross-region difference; China
In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
High-dimensional survival data; Rank independence screening; Sure screening property
Nasopharyngeal carcinoma (NPC) is a malignant neoplasm arising from the mucosal epithelium of the nasopharynx. Different races can have different etiology, presentation, and progression patterns.
Data were analyzed on NPC patients in the United States reported to the SEER (Surveillance, Epidemiology, and End Results) database between 1973 and 2009. Racial groups studied included non-Hispanic whites, Hispanic whites, blacks, Asians, and others. Patient characteristics, age-adjusted incidence and mortality rates, treatment, and five-year relative survival rates were compared across races. Stratification by stage at diagnosis and histologic type was considered. Multivariate regression was conducted to evaluate the significance of racial differences.
Patient characteristics that were significantly different across races included age at diagnosis, histologic type, in situ/malignant tumors in lifetime, stage, grade, and regional nodes positive. Incidence and mortality rates were significantly different across races, with Asians having the highest rates overall and stratified by age and/or histologic type. Asians also had the highest rate of receiving radiation only. The racial differences in treatment were significant in the multivariate stratified analysis. When stratified by stage and histologic type, Asians had the best five-year survival rates. The survival experience of other races depended on stage and type. In the multivariate analysis, the racial differences were significant.
Analysis of the SEER data shows that racial differences exist among NPC patients in the U.S. This result can be informative to cancer epidemiologists and clinicians.
nasopharyngeal carcinoma; racial differences; SEER
MCL (mantle cell lymphoma) is a rare subtype of NHL (non-Hodgkin lymphoma) with mostly poor prognosis. Different races have different etiology, presentation, and progression patterns.
Data were analyzed on MCL patients in the United States reported to the SEER (Surveillance, Epidemiology, and End Results) database between 1992 and 2009. SEER contains the most comprehensive population-based cancer information in the U.S., covering approximately 28% of the population. Racial groups analyzed included non-Hispanic whites, Hispanic whites, blacks, and Asians/PIs (Pacific Islanders). Patient characteristics, age-adjusted incidence rate, and survival rate were compared across races. Stratification by age, gender, and stage at diagnosis was considered. Multivariate analysis was conducted on survival.
In the analysis of patients’ characteristics, distributions of gender, marital status, age at diagnosis, stage, and extranodal involvement were significantly different across races. For all three age groups and both male and female, non-Hispanic whites have the highest incidence rates. In the analysis of survival, for cancers diagnosed in the period of 1992–2004, no significant racial difference is observed. For cancers diagnosed in the period of 1999–2004, significant racial differences exist for the 40–64 age group and stage III and IV cancers.
Racial differences exist among MCL patients in the U.S. in terms of patients’ characteristics, incidence, and survival. More extended data collection and analysis are needed to more comprehensively describe and understand the racial differences.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2407-14-764) contains supplementary material, which is available to authorized users.
Mantle cell lymphoma; Racial differences; SEER; Non-hodgkin lymphoma
High-throughput cancer studies have been extensively conducted, searching for genetic markers associated with outcomes beyond clinical and environmental risk factors. Gene–environment interactions can have important implications beyond main effects. The commonly-adopted single-marker analysis cannot accommodate the joint effects of a large number of markers. The existing joint-effects methods also have limitations. Specifically, they may suffer from high computational cost, do not respect the “main effect, interaction” hierarchical structure, or use ineffective techniques. We develop a penalization method for the identification of important G × E interactions and main effects. It has an intuitive formulation, respects the hierarchical structure, accommodates the joint effects of multiple markers, and is computationally affordable. In numerical study, we analyze prognosis data under the AFT (accelerated failure time) model. Simulation shows satisfactory performance of the proposed method. Analysis of an NHL (non-Hodgkin lymphoma) study with SNP measurements shows that the proposed method identifies markers with important implications and satisfactory prediction performance.
Gene–environment interaction; Penalized marker identification; Cancer prognosis
Penalized regression methods are becoming increasingly popular in genome-wide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed MCP and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with LASSO and MCP approaches are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using heterogeneous stock mice data and a rheumatoid arthritis.
Genetic association; Feature selection; Linkage disequilibrium; Penalized regression; Single nucleotide polymorphism
In the analysis of cancer studies with high-dimensional genomic measurements, integrative analysis provides an effective way of pooling information across multiple heterogeneous datasets. The genomic basis of multiple independent datasets, which can be characterized by the sets of genomic markers, can be described using the homogeneity model or heterogeneity model. Under the homogeneity model, all datasets share the same set of markers associated with responses. In contrast, under the heterogeneity model, different studies have overlapping but possibly different sets of markers. The heterogeneity model contains the homogeneity model as a special case and can be much more flexible. Marker selection under the heterogeneity model calls for bi-level selection to determine whether a covariate is associated with response in any study at all as well as in which studies it is associated with responses. In this study, we consider two minimax concave penalty (MCP) based penalization approaches for marker selection under the heterogeneity model. For each approach, we describe its rationale and an effective computational algorithm. We conduct simulation to investigate their performance and compare with the existing alternatives. We also apply the proposed approaches to the analysis of gene expression data on multiple cancers.
Integrative analysis; Heterogeneity model; Marker selection
Impaired function of Janus kinase/signal transducer and activator of transcription (JAK/STAT) signaling pathway genes leads to immunodeficiency and various hematopoietic disorders. We evaluated the association between genetic polymorphisms (SNPs) in 12 JAK/STAT pathway genes (JAK3, STAT1, STAT2, STAT3, STAT4, STAT5a, STAT5b, STAT6, SCOS1, SCOS2, SCOS3, and SCOS4) and NHL risk in a population-based case-control study of Connecticut women. We identified three SNPs in STAT3 (rs12949918 and rs6503695) and STAT4 (rs932169) associated with NHL risk after adjustment for multiple comparison. Our results suggest that genetic variation in JAK/STAT pathway genes may play a role in lymphomagenesis and warrants further investigation.
JAK/STAT signaling pathway; Non-Hodgkin Lymphoma; polymorphism; case-control study
To develop a reference of population-based gestational age-specific birth weight percentiles for contemporary Chinese.
Birth weight data was collected by the China National Population-based Birth Defects Surveillance System. A total of 1,105,214 live singleton births aged ≥28 weeks of gestation without birth defects during 2006–2010 were included. The lambda-mu-sigma method was utilized to generate percentiles and curves.
Gestational age-specific birth weight percentiles for male and female infants were constructed separately. Significant differences were observed between the current reference and other references developed for Chinese or non-Chinese infants.
There have been moderate increases in birth weight percentiles for Chinese infants of both sexes and most gestational ages since 1980s, suggesting the importance of utilizing an updated national reference for both clinical and research purposes.
In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach.
Integrative analysis; Cancer prognosis; Heterogeneity model; Penalization
To study the association between serum C-reactive protein (CRP) and urinary albumin excretion in the Multi-Ethnic Study of Atherosclerosis and to assess whether the association is modified by ethnicity, sex, or systolic blood pressure.
This was a cross-sectional study of 6675 participants who were free from macro albuminuria and clinical cardiovascular disease (mean age 62.1 years, 53% female; 39% White, 27% African American, 22% Hispanic, and 12% Chinese). Urinary albumin excretion was measured by spot urine albumin-to-creatinine ratio (ACR). Effect modifications were tested after adjusting for age, diabetes, body mass index, smoking, use of angiotensin-converting enzyme inhibitor or angiotensin-receptor blocker, other antihypertensive drugs, estrogens, statins, and high-density lipoprotein cholesterol and triglyceride levels.
The association between CRP and ACR was modified by ethnicity (P=.01) and sex (P<.001), but not by systolic blood pressure. After multivariate adjustment, the association remained in Chinese, African American, and Hispanic men and African American women (P<.02 for African American men, and P<.04 for the other subgroups).
The association between CRP and ACR was modified by ethnicity and sex; it was stronger in non-White men and African American women. These interactions have not been reported before, and future studies should consider them.
Albuminuria; C-Reactive Protein; Ethnicity; Gender
We conducted a population-based case-control study in Connecticut women to test the hypothesis that genetic variations in DNA repair pathway genes may modify the relationship between body mass index (BMI) and risk of non-Hodgkin lymphoma (NHL). Compared to those with BMI < 25, women with BMI ≥ 25 had significantly increased risk of NHL among women who carried BRCA1 (rs799917) CT/TT, ERCC2 (rs13181) AA, XRCC1 (rs1799782) CC, and WRN (rs1801195) GG genotypes, but no increase in NHL risk among women who carried BRCA1 CC, ERCC2 AC/CC, XRCC1 CT/TT, and WRN GT/TT genotypes. A significant interaction with BMI was only observed for WRN (rs1801195, P=0.004) for T-cell lymphoma and ERCC2 (rs13181, P=0.002) for diffuse large B-cell lymphoma. The results suggest that common genetic variation in DNA repair pathway genes may modify the association between BMI and NHL risk.
Non-Hodgkin lymphoma; BMI; polymorphisms; DNA repair genes
We consider analysis of Genetic Analysis Workshop 18 data, which involves multiple longitudinal traits and dense genome-wide single-nucleotide polymorphism (SNP) markers. We use a multivariate linear mixed model to account for the covariance of random effects and multivariate residuals. We divide the SNPs into groups according to the genes they belong to and score them using weighted sum statistics. We propose a penalized approach for genetic variant selection at the gene level. The overall modeling and penalized selection method is referred to as the penalized multivariate linear mixed model. Cross-validation is used for tuning parameter selection. A resampling approach is adopted to evaluate the relative stability of the identified genes. Application to the Genetic Analysis Workshop 18 data shows that the proposed approach can effectively select markers associated with phenotypes at gene level.
Thyroid cancers have increased dramatically over the past few decades. Comorbidities may be important, and previous studies have indicated elevated second cancer risk after initial primary thyroid cancers. This study examined the risk of second cancers after development of a thyroid cancer, primary utilizing the Surveillance, Epidemiology, and End Results (SEER) program database.
The cohort consisted of men and women diagnosed with first primary thyroid cancer who were reported to a SEER database in 1973–2008 (n=52,103). Standardized incidence ratios (SIR) were calculated for all secondary cancers. Confidence intervals and p-values are at 0.05 significance alpha level and are two-sided based on Poisson exact methods.
In this cohort, 4457 individuals developed second cancers. The risk of developing second cancers after a primary thyroid cancer varied from 10% to 150% depending on different cancer types. Cancers in all sites, breast, skin, prostate, kidney, brain, salivary gland, second thyroid, lymphoma, myeloma, and leukemia were elevated. The magnitude of the risk varied by histology, tumor size, calendar year of first primary thyroid cancer diagnosis, and the treatment of the primary thyroid cancer. The risk of a second cancer was elevated in patients whose first primary thyroid carcinomas were small, or were diagnosed after 1994, or in whom some form of radiation treatment was administered.
This large population-based analysis of second cancers among thyroid cancer patients suggests that there was an increase of second cancers in all sites, and the most commonly elevated second cancers were the salivary gland and kidney. Additionally, the increase in second cancers in patients with recently diagnosed thyroid microcarcinomas (<10 mm) suggests that aggressive radiation treatment of the first primary thyroid cancer, the environment, and genetic susceptibility, may increase the risk of a second cancer.
In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize–minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
Group selection; Regularization; SNP; Smoothing
In high-throughput cancer genomic studies, markers identified from the analysis of single datasets may have unsatisfactory properties because of low sample sizes. Integrative analysis pools and analyzes raw data from multiple studies, and can effectively increase sample size and lead to improved marker identification results. In this study, we consider the integrative analysis of multiple high-throughput cancer prognosis studies. In the existing integrative analysis studies, the interplay among genes, which can be described using the network structure, has not been effectively accounted for. In network analysis, tightly-connected nodes (genes) are more likely to have related biological functions and similar regression coefficients. The goal of this study is to develop an analysis approach that can incorporate the gene network structure in integrative analysis. To this end, we adopt an AFT (accelerated failure time) model to describe survival. A weighted least squares approach, which has low computational cost, is adopted for estimation. For marker selection, we propose a new penalization approach. The proposed penalty is composed of two parts. The first part is a group MCP penalty, and conducts gene selection. The second part is a Laplacian penalty, and smoothes the differences of coefficients for tightly-connected genes. A group coordinate descent approach is developed to compute the proposed estimate. Simulation study shows satisfactory performance of the proposed approach when there exist moderate to strong correlations among genes. We analyze three lung cancer prognosis datasets, and demonstrate that incorporating the network structure can lead to the identification of important genes and improved prediction performance.
Integrative analysis; Cancer prognosis; Gene network; Penalized selection; Laplacian shrinkage
In cancer research, high-throughput profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer, but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model, which allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach, which conducts gene-level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis-associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements, and identify genes associated with the three major subtypes of NHL, namely DLBCL, FL and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.
Integrative analysis; Cancer Prognosis; Gradient descent; NHL; SNP
In China, despite a high coverage rate, health insurance is not used for all illness episodes. Our goal is to identify subjects’ characteristics associated with insurance utilization and the association between utilization and medical expenditure.
A survey was conducted in January and February of 2012. 2093 middle-aged and elderly subjects (45 years old and above) were surveyed.
Heath insurance was not utilized for 12.6% (inpatient), 53.3% (outpatient), and 72.6% (self-treatment) of disease episodes. Subjects’ characteristics were associated with insurance utilization. Inpatient and outpatient treatments were expensive. In the multivariate analysis of outpatient treatment expenditure, insurance utilization was significantly associated with higher treatment cost, lost income, and gross total cost.
Utilization of health insurance may need to be improved. Insurance utilization can reduce out-of-pocket medical expenditure. However, the amount paid by the insured is still high. Policy intervention is needed to further improve the effectiveness of health insurance.
Genetic and other scientific studies routinely generate very many predictor variables, which can be naturally grouped, with predictors in the same groups being highly correlated. It is desirable to incorporate the hierarchical structure of the predictor variables into generalized linear models for simultaneous variable selection and coefficient estimation. We propose two prior distributions: hierarchical Cauchy and double-exponential distributions, on coefficients in generalized linear models. The hierarchical priors include both variable-specific and group-specific tuning parameters, thereby not only adopting different shrinkage for different coefficients and different groups but also providing a way to pool the information within groups. We fit generalized linear models with the proposed hierarchical priors by incorporating flexible expectation-maximization (EM) algorithms into the standard iteratively weighted least squares as implemented in the general statistical package R. The methods are illustrated with data from an experiment to identify genetic polymorphisms for survival of mice following infection with Listeria monocytogenes. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Adaptive Lasso; Bayesian inference; Generalized linear model; Genetic polymorphisms; Grouped variables; Hierarchical model; High-dimensional data; Shrinkage prior
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study.
Bi-level selection; group LASSO; concave group selection; penalized regression; sparsity; oracle property
For censored survival outcomes, it can be of great interest to evaluate the predictive power of individual markers or their functions. Compared with alternative evaluation approaches, the time-dependent ROC (receiver operating characteristics) based approaches rely on much weaker assumptions, can be more robust, and hence are preferred. In this article, we examine evaluation of markers’ predictive power using the time-dependent ROC curve and a concordance measure which can be viewed as a weighted area under the time-dependent AUC (area under the ROC curve) profile. This study significantly advances from existing time-dependent ROC studies by developing nonparametric estimators of the summary indexes and, more importantly, rigorously establishing their asymptotic properties. It reinforces the statistical foundation of the time-dependent ROC based evaluation approaches for censored survival outcomes. Numerical studies, including simulations and application to an HIV clinical trial, demonstrate the satisfactory finite-sample performance of the proposed approaches.
time-dependent ROC; concordance measure; inverse-probability-of-censoring weighting; marker evaluation; survival outcomes
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge