Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Cancer Epidemiol Biomarkers Prev. Author manuscript; available in PMC 2010 November 1.
Published in final edited form as:
PMCID: PMC2783293

Identifying Genes for Establishing a Multigenic Test for HCC Surveillance in HCV+ Cirrhotic Patients

Kellie J. Archer, Ph.D.,1,2, Valeria R. Mas, Ph.D.,3,4 Krystle David,3 Daniel G. Maluf, M.D.,3 Karen Bornstein, R.N.,3 and Robert A. Fisher, M.D.3


In this study, we used the Affymetrix HG-U133A version 2.0 GeneChips for identifying genes capabable of distinguishing cirrhotic liver tissues with and without hepatocellular carcinoma (HCC) by modeling the high-dimensional dataset using an L1 penalized logistic regression model, with error estimated using N-fold cross-validation. Genes identified by gene expression microarray included those that have important links to cancer development and progression, including VAMP2, DPP4, CALR, CACNA1C, and EGR1. In addition, the selected molecular markers in the multigenic gene expression classifier were subsequently validated using real-time PCR and an independently acquired gene expression microarray dataset downloaded from Gene Expression Omnibus. The multigenetic classifier derived herein performed similarly or better that standard abdominal ultrasonography and serum alpha-fetoprotein, which are currently used for HCC surveillance. Since early HCC diagnosis increases survival by increasing access to therapeutic options, these molecular markers may prove useful for early diagnosis of HCC, especially if prospectively validated and translated into gene products that can be reproducibly and reliably tested non-invasively.

Keywords: hepatocellular carcinoma, hepatitis C virus, hepatocarcinogenesis, microarray, gene expression


Surveillance for hepatocellular carcinoma includes following patients with chronic hepatitis or liver cirrhosis every six to 12 months (1) and monitoring them with abdominal ultrasonography (US), serum alpha-fetoprotein (AFP), and/or the protein induced by vitamin K absences (PIVKA-II) (2). Abdominal US has been described as highly user-dependent (3). Although AFP determination is less costly compared to US (4), it is a non-specific marker for HCC, especially among HCV cirrhotic patients. In fact, in a series of 606 HCC patients, normal AFP levels (<20 ng/ml) were observed in 40.4% of patients with small HCC (≤2 cm diameter), in 24.1% of patients with tumors 2–3 cm in diameter, and in 27.5% of patients with 3–5 cm tumors (5). Nevertheless, asymptomatic patients diagnosed with HCC in a screening program that included US and AFP monitoring had significantly smaller tumors, significantly increased number of patients able to undergo treatment, and significantly longer survival compared to patients presenting with symptomatic HCC (4).

Due to the poor clinical outcomes of patients with hepatitis-C induced cirrhosis who are diagnosed with advanced stage hepatocellular carcinoma, improved markers for early detection are needed. Markers useful for early diagnosis may reduce time to transplantation and thereby yield improved patient outcomes. In this study, a multigenic classifier was derived using gene expression microarray data that is capable of detecting the presence of HCC in cirrhotic tissues, since cirrhotic tissues have been described as a pre-malignant condition (6). Thereafter, the selected molecular markers in the multigenic gene expression classifier were validated using real-time PCR and an independently acquired gene expression microarray dataset.


Affymetrix HG-U133A 2.0 GeneChip Arrays were available for sixteen cirrhotic tissues from patients with HCV+HCC and 47 cirrhotic tissues from HCV+ patients who did not have concomitant HCC. The study was approved by the Institutional Review Board at Virginia Commonwealth University and informed consent was obtained from all patients. The sample preparation protocol followed the Affymetrix GeneChipR Expression Analysis Manual (Santa Clara, CA). Total RNA was extracted from tissue samples using TRIzol (Life Technologies, Rockville, MD). Integrity of RNA was checked using Agilent 2100 Bioanalyzer. Briefly, total RNA was reverse-transcribed using T7-polydT primer and converted into double-stranded cDNA (One-Cycle Target Labeling and Control Reagents, Affymetrix, Santa Clara, CA), with templates being used for an in vitro transcription reaction to yield biotin-labeled antisense cRNA. The labeled cRNA was chemically fragmented and made into the hybridization cocktail according to the Affymetrix GeneChip protocol, which was then hybridized to U133A 2.0 GeneChips. The array image was generated by the high-resolution GeneChipR Scanner 3000 by AffymetrixR (Affymetrix, Santa Clara, CA). The data are available from Gene Expression Omnibus1.


Patient demographics were examined for each group and continuous variables were compared using a two-sample t-test while categorical variables were compared using Fisher’s exact test. For the gene expression microarray data, the robust multiarray average method was used to obtain probe set expression summaries (7). Thereafter, control probe sets were removed leaving 22,215 probe sets for statistical analysis. Ordinarily when predicting a dichotomous class, such as HCV+ cirrhosis with and without HCC, logistic regression is commonly used. However, traditional logistic regression models cannot be estimated when the number of predictor variables (p) exceeds the sample size (n). Even if the gene expression dataset were filtered using the False Discovery Rate method, the number of predictors for this dataset still greatly exceeded the sample size, with 4,379 probe sets significant using an FDR of 10% and 2,386 probe sets significant using an FDR of 5%.

Penalized methods have been effectively used when modeling microarray data to identify important genes as well as gene groups associated with survival and dichotomous outcomes (8, 9). The least absolute shrinkage and selection operator (LASSO) is a penalized method for estimating a logistic regression model when p>n and when there is collinearity among the candidate predictors (10). The LASSO model is estimated using maximum likelihood with the additional constraint that the sum of the absolute values of the regression coefficients is less than some tuning parameter, t, which renders a sparse solution (10). The LASSO model was fit to predict class where the final model selected was that having the minimum AIC using the glmpath package in the R programming environment. The minimum AIC was selected over the minumum BIC based on a large simulation study in which it was concluded that while the BIC tends to select the right-sized model, the AIC more often includes a non-zero coefficient estimate for the true predictor (11). The predicted class was cirrhosis with HCC if the fitted probability was ≥0.50 and cirrhosis without HCC otherwise. To obtain an unbiased estimate of classification error, leave-one-out cross-validation, also referred to as N-fold cross-validation where N is the total sample size, was used. All analyses were conducted in the R programming environment (12) using appropriate Bioconductor packages (13). The LASSO models were fit using the glmpath package (14).

For diagnostic purposes, to obtain a more parsimonious model, all probe sets included in the N different LASSO models from the N-fold cross-validation procedure were then subjected to a best subsets logistic regression modeling procedure. Best subsets identifies the best fitting model for each model size p=1,2,…,P, where p reflects the number of covariates in the model, by an exhaustive search whereby all possible combinations of models are fit and the optimal model for each size p is selected. The leaps package in the R programming environment was used for the best subsets procedure. Again, error was estimated using N-fold cross-validation. In addition, an independent gene expression microarray dataset was used as an independent test set for assessing error.

RT-PCR was used to for measuring gene expression of CACNA1C, CALR, DPP4, EGR1, and VAMP2, with GAPDH profiled as the internal housekeeping gene. The dataset was first restricted to samples with HCV-cirrhosis (N=26), HCV-EtOH cirrhosis (N=14), and HCV-cirrhosis with concomitant HCC (N=23) and the two former groups were combined to form one group representing HCV-induced cirrhosis without HCC (N=40). For all RT-PCR analyses, the quantity used in the statistical analysis was the difference between the mean CT of the gene of interest and GAPDH CT, or μCT(gene) − μCT(GAPDH). For each of the five genes profiles using RT-PCR, a two-sample t-test was performed to compare the mean expression between the HCV-induced cirrhosis without HCC (N=40) and HCV-cirrhosis with HCC (N=23) (15). Thereafter, a logistic regression model was derived using a backward elimination method whereby genes remained in the model provided P<0.10.


There were no significant differences between HCV+ patients with and without HCC when examining patient demographic characteristics with respect to patient age, gender, race, Albumin, and ALT, though the two groups differed with respect to INR and total bilirubin (Supplementary Table 1).

The best fitting LASSO model included 14 probe sets (Supplementary Table 2). The resubstitution error associated with this best fitting model was 1.6%, or in other words, 98.4% of the samples were correctly classified. The resubstitution error is known to be downwardly biased, and therefore N-fold cross-validation was used as an additional means to assess error. The N-fold cross-validation error was 9.5%, or the classifier was 90.5% accurate. Using N-fold CV, 5 HCV+HCC and 1 HCV+cirrhosis without concomitant HCC cases were misclassified, rendering N-fold estimates for sensitivity, specificity, positive predictive value, and negative predictive value of 68.8%, 97.9%, 91.7%, and 90.2% respectively.

From the N-fold CV procedure there were 63 different LASSO models, each of which included different probe sets having non-zero coefficient estimates. In fact, considering all 63 different models, there were 62 unique probe sets included. Best subsets logistic regression was performed using the 62 probe sets yielding the sequence of models listed in Supplementary Table 3. Due to collinearities, the best fitting logistic regression model contained two genes, DPP4 and CALR. N-fold CV on this two-gene model resulted in a 6.3% error rate, corresponding to 93.7% accuracy, with two HCV+HCC and two HCV+ cirrhotic cases misclassified. One HCV+HCC misclassified sample was from a patient with one 2.6 cm T2N0M0 tumor, while the other HCV+HCC misclassified sample was from a patient with four T2N0M0 tumors of sizes 0.9, 1.0, 1.4, and 1.6 cm. This yielded N-fold CV estimates for sensitivity, specificity, positive predictive value, and negative predictive value of 87.5%, 95.7%, 87.5%, and 95.7%, respectively. A scatterplot of CALR against DPP4 demonstrates that the two groups are well separated by a straight line, (Figure 1).

Figure 1
Scatterplot of CALR against DPP4 using Affymetrix GeneChip data with plotting symbol indicating whether the observation is HCV+cirrhosis with HCC or HCV+cirrhosis without HCC.

To test the molecular classifier using an independent set of samples, data from a previously gene expression study that included 13 HCV cirrhotic non-tumor liver tissues and 33 HCV+HCC liver samples hybridized to an Affymetrix HG-U133 Plus 2.0 GeneChip were downloaded from Gene Expression Omnibus2 (16). All probe sets on the HG-U133A 2.0 array were also on the HG-U133 Plus 2.0 array. Among the 33 HCC tissues were 7 very early HCC, 9 early HCC, 7 advanced HCC, and 10 very advanced HCC. To mitigate any effect due to GeneChip type and center, prior to obtaining probe set expression summaries, the 63 HG-U133A 2.0 GeneChips and the 46 HG-U133 Plus 2.0 GeneChips were independently read into the R programming environment using the affy Bioconductor package (17). Thereafter, probe level data from the two GeneChip types were merged by probe sequence using matchprobes package in R (18). Subsequently, the robust multiarray average method was used to obtain probe set expression summaries (7).

The logistic regression model including DPP4 and CALR was fit to the VCU acquired samples was applied to the independent set described by Wurmbach et al. (2007). The sensitivity for detecting HCC in the independent test set was 72.7%, with 77.4% positive predictive value, though the specificity was only 46.2%. Among the nine HCC tissues were misclassified, 3 were very early HCC, 2 early HCC, 3 advanced HCC, and one very advanced HCC. Unfortunately, there were 38 unique HCV infected patients included in the previously published study, however, the number and sample labels of non-tumor cirrhotic tissues procured from patients with concomitant HCC cannot be identified. Since our model was derived using cirrhotic tissues with and without concomitant HCC, the low specificity observed in this independent test set may indicate many samples were from patients with concomitant HCC. We do note that other biomarkers used for cancer screening have also reported low specificities, including prostate specific antigen (PSA) which is used to screen subjects for prostate cancer (19).

Genes identified by the best fitting LASSO model and those derived in the cross-validation process had important links to cancer development and progression, including VAMP2 (20), DPP4 (21), CALR (22), CACNA1C (23), and EGR1 (24). These genes were then profiled using real time PCR; the interrogated sequence for Taqman overlapped the Affymetrix probe set target sequence for all five genes. P-values from the two-sample t-test applied to each gene for comparing the mean expression between the HCV-induced cirrhosis without HCC (N=40) and HCV-cirrhosis with HCC (N=23) were CACNA1C (P=0.007), CALR (P=0.09), DPP4 (P=0.98), EGR1 (P=0.0003), and VAMP2 (P=0.009). The final logistic regression model included DPP4, VAMP2, and EGR1; the odds ratio corresponding to a one cycle change in the CT difference and 95% confidence intervals (CI) as well as the P-values for the genes in the final model are presented in Table 1. Thus while DPP4 was not significant univariately, when collectively considered with other genes it was important for controlling confounding (25). The sensitivity, specificity, positive predictive value, and negative predictive value for the final model using a cutpoint of 0.50 on the predicted probabilities are 78.3%, 87.5%, 78.3%, and 87.5%, respectively. The area under the receiver operator characteristic curve was 87.6 (Supplementary Figure 1). To visualize how these three genes vary together by group, a conditioning plot (Figure 2) was constructed whereby VAMP2 by EGR1 is plotted in each panel and each panel represents a range of DPP4 expression as indicated in the barchart. Clearly, the two groups are separated when examining VAMP2 by EGR1 for the various levels of DPP4.

Figure 2
Conditioning plot whereby VAMP2 by EGR1 is plotted in each panel and each panel represents a range of DPP4 expression as indicated in the barchart with open circles representing HCV-cirrhosis without HCC and open triangles representing HCV-HCC. In reading ...
Table 1
Multivariable logistic regression model predicting HCV-cirrhosis with and without concomitant HCC using the RT-PCR data.


Identification of prognostic factors are relevant if therapeutic options may be applied to newly diagnosed cases (26). Surveillance strategies for detecting HCC are important because early HCC diagnosis increases survival by increasing access to therapeutic options (6, 27). In fact, the 5 year survival of HCC detected at an early stage has been reported to exceed 50% (28). Using gene expression microarray data, we identified a multigenic classifier consisting of a small number of genes having good estimates of accuracy. In fact, the sensitivity of abdominal US alone has been reported to be 81.9% and the sensitivity of AFP alone as 74.2% (4). The multigenetic classifier derived herein performed similarly or better. These genes, if prospectively validated and translated into gene products reproducibly and reliably testable non-invasively, would be clinically useful as patients could be cost-effectively screened at more regular intervals than every 6 months.

Supplementary Material


This research was supported in part by National Institute of Diabetes & Digestive & Kidney Diseases R01DK069859.

This project was supported by a National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) grant, RO1DK069859.


1. Bruix J, Sherman M. Management of hepatocellular carcinoma. Hepatology. 2005;42:1208–36. [PubMed]
2. Llovet JM, Bruix J. Early diagnosis and treatment of hepatocellular carcinoma. Baillieres Best Pract Res Clin Gastroenterol. 2000;14:991–1008. [PubMed]
3. Beale G, Chattopadhyay D, Gray J, et al. AFP, PIVKAII, GP3, SCCA-1 and follisatin as surveillance biomarkers for hepatocellular cancer in non-alcoholic and alcoholic fatty liver disease. BMC Cancer. 2008;8:200. [PMC free article] [PubMed]
4. Yuen MF, Cheng CC, Lauder IJ, Lam SK, Ooi CG, Lai CL. Early detection of hepatocellular carcinoma increases the chance of treatment: Hong Kong experience. Hepatology. 2000;31:330–5. [PubMed]
5. Nomura F, Ohnishi K, Tanabe Y. Clinical features and prognosis of hepatocellular carcinoma with reference to serum alpha-fetoprotein levels. Analysis of 606 patients. Cancer. 1989;64:1700–7. [PubMed]
6. McCaughan GW, Koorey DJ, Strasser SI. Hepatocellular carcinoma: current approaches to diagnosis and management. Intern Med J. 2002;32:394–400. [PubMed]
7. Irizarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–64. [PubMed]
8. Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–8. [PubMed]
9. Ma S, Song X, Huang J. Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics. 2007;8:60. [PMC free article] [PubMed]
10. Tibshirani R. The lasso method for variable selection in the cox model. Statistics in Medicine. 1997;16:385–395. [PubMed]
11. Archer KJ. Identifying important predictors using L1 penalized models and random forests. JSM Proceedings; Alexandria, VA: American Statistical Association; 2009.
12. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing;
13. Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. [PMC free article] [PubMed]
14. Park MY, Hastie T. L1-regularized path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B. 2007;69:659–677.
15. Yuan JS, Reed A, Chen F, Stewart CN., Jr Statistical analysis of real-time PCR data. BMC Bioinformatics. 2006;7:85. [PMC free article] [PubMed]
16. Wurmbach E, Chen YB, Khitrov G, et al. Genome-wide molecular profiles of HCV-induced dysplasia and hepatocellular carcinoma. Hepatology. 2007;45:938–47. [PubMed]
17. Gautier L, Cope L, Bolstad BM, Irizarry RA. affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004;20:307–15. [PubMed]
18. Huber W, Gentleman R. matchprobes: a Bioconductor package for the sequence- matching of microarray probe elements. Bioinformatics. 2004;20:1651–2. [PubMed]
19. Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst. 2001;93:1054–61. [PubMed]
20. Grabowski P, Schonfelder J, Ahnert-Hilger G, et al. Expression of neuroendocrine markers: a signature of human undifferentiated carcinoma of the colon and rectum. Virchows Arch. 2002;441:256–63. [PubMed]
21. Roesch A, Wittschier S, Becker B, Landthaler M, Vogt T. Loss of dipeptidyl peptidase IV immunostaining discriminates malignant melanomas from deep penetrating nevi. Mod Pathol. 2006;19:1378–85. [PubMed]
22. Li R, Wang H, Bekele BN, et al. Identification of putative oncogenes in lung adenocarcinoma by a comprehensive functional genomic approach. Oncogene. 2006;25:2628–35. [PubMed]
23. Iwasa K, Komai K, Yasukawa Y, Maruta T, Takamori M. [Molecular immunology of voltage-gated calcium channel and Lambert-Eaton myasthenic syndrome] Nippon Rinsho. 1997;55:3322–30. [PubMed]
24. Lee SH, Bahn JH, Choi CK, et al. ESE-1/EGR-1 pathway plays a role in tolfenamic acid-induced apoptosis in colorectal cancer cells. Mol Cancer Ther. 2008;7:3739–50. [PMC free article] [PubMed]
25. Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: John Wiley & Sons; 1989.
26. Simon R. Development and validation of therapeutically relevant multi-gene biomarker classifiers. J Natl Cancer Inst. 2005;97:866–7. [PubMed]
27. Stravitz RT, Heuman DM, Chand N, et al. Surveillance for hepatocellular carcinoma in patients with cirrhosis improves outcome. Am J Med. 2008;121:119–26. [PubMed]
28. Bruix J, Llovet JM. Hepatocellular carcinoma: is surveillance cost effective? Gut. 2001;48:149–50. [PMC free article] [PubMed]