|Home | About | Journals | Submit | Contact Us | Français|
Glycocylation represents the most complex and widespread post-translational modifications in human proteins. The variation of glycosylation is closely related to oncogenic transformation. Therefore, profiling of glycans detached from proteins is a promising strategy to identify biomarkers for cancer detection. This study identified candidate glycan biomarkers associated with hepatocellular carcinoma by mass spectrometry. Specifically, mass spectrometry data were analyzed with a peak selection procedure which incorporates multiple random sampling strategies with recursive feature selection based on support vector machines. Ten peak sets were obtained from different combinations of samples. Seven peaks were shared by each of the 10 peaksets, in which 7–12 peaks were selected, indicating 58–100% of peaks were shared by the 10 peaksets. Support vector machines and hierarchical clustering method were used to evaluate the performance of the peaksets. The predictive performance of the seven peaks was further evaluated by using 19 newly generated MALDI-TOF spectra. Glycan structures for four glycans of the seven peaks were determined. Literature search indicated that the structures of the four glycans could be found in some cancer-related glycoproteins. The method of this study is significant in deriving consistent, accurate, and biological significant glycan marker candidates for hepatocellular carcinoma diagnosis.
As the most complex and widespread post-translational modification (PTM),1 glycosylation plays crucial roles during different oncogenetic processes.2-4 Many important tumor markers, such as CEA,5 CA125,6 and PSA,7-9 are glycoproteins with altered glycan profiles in cancer. As one of the most common types of malignant tumor, hepatocellular carcinoma (HCC) is difficult to diagnose due to the highly heterogenic nature of the disease and has a low survival rate once diagnosed.10 The popular method to diagnose HCC is to measure a serum glycoprotein marker alpha-fetoprotein (AFP). However, this marker has limited sensitivity (41–65%).11 This sensitivity could be improved by measuring several highly specific glycoprotein markers.11 Therefore, there is an urgent need to discover additional markers associated with HCC for the early diagnosis.
Currently, glycan marker discovery by analyzing mass spectrometry (MS) data presents great potential to identify a panel of biomarkers relevant for early diagnosis of heterogenic diseases with improved accuracies.12-15 However, this approach is characterized by high dimensionality and complex patterns with a substantial amount of noise arising from measurement deviation, disease heterogeneity, and biological variability. A robust computational method is required to identify markers relevant to a particular problem from the MS data sets. Machine learning methods, especially neural networks and support vector machines (SVM), provide potential application in the marker selection from MS data. Using shallow feature selection method and Bayesian neural network (BNN) classifier, 99% sensitivity and 98% specificity were reached on the SELDI-TOF MS data to identify ovarian cancer using 2-fold cross validation (CV).12 Information gain and SVM classifier were used for prion disease diagnosis from MALDIFTMS data and yielded 72% sensitivity and 73% specificity by leave-one-out cross validation (LOOCV).13 t-test and several classification methods such as discriminant analyses, k-nearest neighbor analysis, and SVM were used for identifying ovarian cancer cases from normal patients, and SVM have resulted in the lowest error rates.15 These methods utilized a filter strategy which identifies relevant peaks independent of the classifiers. In the feature selection from machine learning classifiers, another popular strategy is wrapper method, where classifiers that built from different peak subsets evaluate the goodness of peak subsets by such criteria as CV error rate or accuracy from the validation data set; the wrapper method presented good performance and a stable feature subset when applied in microarray gene discovery16 and can be extended into MS peak selection. Mahadevan et al. built a feature selection method known as recursive feature elimination-support vector machine (RFE-SVM) to separate pneumonia from healthy people by mass spectrometry. They obtained an overall accuracy of 84–96% by 4-fold CV and 87–97% by LOOCV, providing much better predictive performance when compared with multivariate analysis methods.17
In our previous studies, we utilized ant colony optimization combined with support vector machines (ACO-SVM) peak selection to identify biomarkers for HCC diagnosis14,18-21 using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS). Six peptide markers were identified that yielded 100% sensitivity and 91% specificity in an independent test set,14,21 and two of them were found to be fragments of Complement C3 and C4.14,21 Six to 10 glycan markers were found to be associated with HCC with 87–93% sensitivity and 89–100% specificity.14,18-20 Two glycans with permethylated molecular weight at 2040 and 4502 were identified in most of these studies.14,19,20 In this study, we apply an integrated RFE-SVM approach to select glycan markers correctly distinguishing HCC cases from chronic liver disease (CLD) patients through MALDI-TOF MS data. This method incorporated RFE-SVM with the consensus scoring system, multiple random sampling method, and feature-ranking consistency evaluation system. We generated 10 discriminant peak sets from the various combinations of MALDI-TOF spectra. Each set contained 7–12 peaks; 7 glycan peaks were present in all 10 peak sets, indicating that 58–100% of all the peaks in the 10 sets were shared by all of them. These results suggest that the discriminant peaksets are insensitive to the sample combinations. Different evaluation methods and a newly generated data set were used to evaluate the performances of the selected peaks. Among previously selected candidate markers, those with permethylation molecular weight at 2040, 2186, and 450214,18-20 were included in the 7 peaks which were selected by all of the 10 sets. The structures of 4 glycans of the 7 peaks were determined. A literature search indicated that the structures of the four glycans could be found in cancer-related glycoproteins beta-glucuronidase, cytochrome, and immunoglobulin, suggesting their possible roles in cancer progression.
Two hundred and three serum samples were collected from 73 HCC cases, 52 CLD cases, and 78 healthy individuals from Cairo, Egypt.14 Diagnosis of HCC was confirmed by pathology, cytology, imaging, and serum AFP. CLD cases included fibrosis and cirrhosis patients. Controls were recruited among patients from an orthopedic and fracture clinic and were frequency-matched to cancer cases by gender, age, and smoking status.14
N-Glycans were released from glycoprotein, extracted, and permethylated by a solid-phase approach. The resulting permethylated glycans were spotted on a MLADI plate with DHB-matrix, and the MALDI plate was dried under vacuum. Mass spectra were acquired using a 4800 MALDI TOF/TOF Analyzer (Applied Biosystems, Inc.) equipped with a Nd:YAG 355-nm laser. MALDI spectra were recorded in positive-ion mode, since permethylation eliminates the negative charge normally associated with sialylated glycans.
Two hundred and three mass spectra, including 73 HCC cases, 52 CLD cases and 78 healthy individuals, were acquired at the National Center for Glycomics and Glycoproteomics at Indiana University. The 73 HCC spectra and 52 CLD spectra were utilized for peak selection. Nineteen additional spectra were generated at the Proteomics and Metabolomics Shared Resource at Georgetown University Medical Center and used for peak evaluation. The 19 spectra were obtained from 9 CLD and 10 HCC samples, which are a subset of the previously analyzed 52 CLD and 73 HCC cases. The same sample preparation protocol14 was used in generating the original 203 spectra and the additional 19 spectra.
Structural assignment of the different N-glycans was achieved through high-energy collision incident dissociation (CID) tandem MS analyses. Although tandem MS analysis utilizing high energy CID is currently allowing comprehensive characterization of glycan structures, enzymatic sequencing, in some cases, is deemed necessary to confirm structural attributes such as linkages. This is accomplished by employing an array of excoglycosidases including those specific for different sialic acid, fucose and galactose linkages. For enrichment of minor glycans, we first used the quantification by TOF-MS to select samples with the highest intensity of a glycan of interest. The glycan is then further enriched from these samples by micro fractionation, as described previously.22,23 Briefly, our method for separation of the permethylated glycans incorporates derivatization, C18 trapping of permethylated glycans, and C18 nano-LC separation.22 The chromatographic separation of permethylated glycans is attained initially using an isocratic condition of 32% phase B (acetonitrile containing 0.1% formic acid) for 40 min, followed by an increase of phase A to 55% over 15 min. The mobile composition is then held at 55% for 10 min. Phase A consists of 3% aqueous acetonitrile with 10 mM ammonium formate. The separation is achieved on C18 nanocolumn (150 × 0.075 mm fused silica pulled-tip column).
Each spectrum of the 203 MALDI sample consisted of approximately 121 000 data points in the mass range of 1500–5500 Da. To reduce their dimension, the spectra were first binned with the size of 0.2 Da. The mean of the intensities within each bin was used as the intensity variable. Using this binning method, the dimension of each spectrum was reduced to 20 000. Baseline-corrected spectra were then used to indentify peaks by using a wavelet-based peak detection method.24 After peak identification, the spectra were normalized with the highest peak as 100 and the lowest peak as 0.
After the alignment of the 203 mass spectra from the peak lists,25 3397 peaks were identified. Since there are artifacts in the aligned intensity, the intensity of each aligned m/z value was treated as the maximum intensity in the original spectra within ±0.5 Da. The average intensities of the aligned peaks were used to determine possible clusters of isotopic distributions. With this approach, the number of peaks was further reduced to 447 assuming that each isotopic cluster contains only one glycan molecule. The 447 peaks were utilized to build classifiers and to select the most discriminant peaks.
Support vector machines (SVM), a supervised machine learning method proposed by Vapnik, was used in this study as the classification tool.26 SVM-based classifiers are shown to prove excellent classification performance and have been successfully applied in microarray data analysis16 and mass spectrometry data analysis.27,28
There are two types of SVM: linear and nonlinear. In linear SVM, a hyperplane in the feature space is directly constructed. This hyperplane, which separates two different classes of feature vectors with a maximum margin, is generated by finding a vector w and a variable b that minimizes w2, which satisfies the following conditions: w · xi + b ≤ −1, for yi = −1 (cancer patients) and w · xi + b ≤ −1, for yi = −1 (noncancer people). Here, xi is a feature vector, yi is the group index, w is a vector normal to the hyperplane, |b|/w is the perpendicular distance from the hyperplane to the origin, and w is the Euclidean norm of w. This optimization problem can be described as to minimize 0.5w2 subject to yi(w · xi + b) ≥ 1. With the introduction of Lagrangian multiplier ai, this problem can be rewritten as where αi ≥ 0, and efficiently solved by maximizing under the constraints and αi ≥ 0, i = 1, 2, …, n. After the determination of w and b, a given vector x can be classified by using sign [(w · x) + b]. A positive or negative value indicates that the vector x belongs to the positive or negative class, respectively.
Nonlinear SVM, by using a kernel function, projects both positive and negative examples into a higher-dimensional feature space and then the linear SVM procedure can be applied to the feature vectors in this feature space. In our experiments, Gaussian kernel K(X,X′) = exp (−(X − X′)/(2σ2)) was used since this kernel performs similar with linear kernel and sigmoid kernel under certain parameters,29-31 and few parameters were needed to be tuned when comparing with the polynomial kernel.
The performance of SVM classification can be measured by true positive TP (number of cancer patients correctly predicted as cancer patient), false negative FN (number of cancer patients incorrectly predicted as noncancer people), true negative TN (number of noncancer people correctly predicted noncancer people), and false positive FP (number of noncancer people incorrectly predicted as cancer patients). Three indicators, sensitivity Qp = TP/(TP + FN), specificity Qn = TN/(TN + FP), and overall accuracy Q = (TP + TN)/(TP + FN + TN + FP), were used to measure the predictive performance.
The recursive feature elimination-SVM (RFE-SVM) method has been successfully used for feature selection from high dimensional microarray and mass spectrometry data.16,27,28 Briefly, RFE-SVM uses the prediction accuracy from SVM to assess the goodness of each peak and to determine a peak ranking. The peak ranking criterion of RFE-SVM is based on the change in the objective function upon removing each peak. This objective function can be represented by a cost function J = 0.5aTHa − aT1, where H is the matrix with elements H(i,j) = yiyjK(xi,xj), computed by using the training set only. When a given peak k is removed or its weight wk is reduced to zero, the change in the cost function J(k) is given by DJ(k) = 0.5(2J/wk2)(Dwk)2. The change in weight Dwk=wk−0 corresponds to the removal of peak k. Hence, the change in the cost function can be written as DJ(k) = 0.5aTHa−0.5a*TH(−k)a*, where H is the matrix with elements yiyjK(xi,xj). H(−k) is the matrix computed by using the same method as that of matrix H but with its kth component removed. For the sake of complicity and to reduce computational complexity, α* is supposed to be equal to α, under the assumption that the removal of one peak will not significantly influence the values of αs. The change in the cost function indicates the contribution of the peak to the decision function and serves as an indicator of peak ranking position.
To present statistical meaning, peak selection is conducted based on a multiple random sampling strategy. Each random sampling divides all MALDI-TOF spectra into a training set that contains half of the samples and an associated test set, which contains the remaining half. By using this random sampling strategy, 5000 training-test sets, each containing a unique combination of samples, are generated. These 5000 training-test sets are randomly divided into 10 sampling groups, with 500 training-test sets in each group (Figure 1). Every sampling group is then used to derive a signature by RFE-SVM.
In every training-test sampling group generated by multiple random sampling, each training set (totally 500 training sets) is used to train a SVM class-differentiation system and the peaks are ranked by using RFE according to the contribution of peaks to the SVM classifier. The performance of each peak subset which selected in every iteration step is evaluated on the associated 500 test sets. Two typical variables used in RBF kernel, σ and C, are kept constant during all iterations in the peak elimination procedure and among all the 500 training set in this group, in order to derive a peak ranking criterion consistent for all iterations and different combination of samples. Different combinations of variables σ and C are scanned. One combination of the two variables can be defined as a universal set of globally optimized variables, if the best average class differentiation accuracy over the 500 test sets in this group is achieved by a SVM class differentiation system using these variables and on a peak subset which selected in a certain iteration step. The peak subset can be determined as one discriminant peak set.
To further reduce the chance of erroneous elimination of predictor-peaks, additional peak-ranking consistency evaluation steps are implemented on top of the normal RFE procedures in each group:
For each sampling group, different SVM parameters are scanned, various RFE iteration steps are evaluated to identify the globally optimal SVM parameters and RFE iteration steps that give the highest average class-differentiation accuracy for the 500 testing-sets. The 10 discriminant peak sets derived from these sampling-groups are then applied to evaluate the stability and performance.
The computational complexity of SVM is of O(nm2) where m is the number of training samples and n is the number of peaks. In the feature selection procedure, SVM will be retrained many times with a decreasing number of features. The number of iterations is n if we remove the peaks one by one. When the multiple randomly sampling strategies are applied, the RFE-SVM needs to be executed l times, where l is the number of training sets used in each sampling group. Hence the overall computational complexity is O(ln2m2).
The 203 MALDI glycan spectra were obtained from 73 HCC cases, 52 CLD cases, and 78 healthy samples from Egyptian population.14 The 52 CLD cases included 21 fibrosis patients, 25 cirrhosis patients, and 6 patients with unknown clinical information. Each mass spectrum contained approximately 121 000 data points in the mass range of 1500–5500 Da. After the preprocessing procedure of binning, baseline correction, peak identification, and peak alignment, the dimension of glycan spectra was reduced to 3397 m/z peaks.
The average intensities of the 3397 peaks in 203 spectra were used to determine a cluster of peaks that can represent an isotopic distribution. Figure 2 shows the average peak intensities of the 203 spectra in the mass range from 2800 to 2900. A close examination of the peaks in the figure reveals that individual glycans are represented by a cluster of peaks. This is due to the high resolution MALDI-TOF instrument used in this study that has the capability to resolve isotopes of individual glycans. We assumed that each cluster contains only one glycan molecule in order to simplify the problem, and reduced the number of peaks to 447 by selecting the local maximum peaks.
Hierarchical clustering analysis32 based on the 447 peaks grouped the 203 samples into three clusters (Table 1). The first cluster was dominated by HCC samples; it included 67 HCC, 4 CLD and 27 healthy samples. We call this cluster HCC Cluster. The second cluster, named as CLD Cluster, was dominated by CLD cases, including 3 HCC, 24 CLD and 14 healthy samples. The third cluster, named as Healthy Cluster, contained 3 HCC, 24 CLD and 37 healthy samples. Ninty-two percent of the HCC samples were found in the HCC Cluster, whereas 92% of the CLD samples and 65% of the healthy samples were present in the CLD or the Healthy Clusters, respectively. These results suggest that the 447 peaks have a certain discriminating ability to differentiate cancer patient from noncancer individuals (CLD and healthy) through an unsupervised clustering method.
Using supervised SVM classifier and 447 peaks through the LOOCV evaluation method, 95% accuracy to separate HCC from CLD samples and 96% accuracy to separate HCC from noncancer samples (CLD and healthy) were achieved. These results suggest that the 447 peaks could perform perfectly in separating HCC from CLD or healthy individuals by supervised analysis.
To determine the peaks most pertinent to differentiating HCC spectra from CLD spectra, 10 peak subsets were obtained using integrated RFE-SVM from 5000 training-test data sets with 447 peaks. These 5000 training-test data sets were generated through a random sampling method, with each training set containing 36–37 HCC and 26–27 CLD, and their associated test set consisting of the remaining HCC and CLD samples (36–37 HCC and 25–26 CLD). These data sets were divided into 10 groups, with 500 training-test data set in each group. For each group, one peakset was obtained. Subsequently, totally 10 peaksets were generated. There were 7 to 12 peaks for each peak set. The stability of the peak sets was estimated from the percentage of the peaks shared by all of the 10 peak sets. From Table 2, it is obvious that seven peaks were selected by all of the 10 peak sets. This indicates that 58–100% of all the peaks in each peak set were shared by all of 10 sets, suggesting that our method is quite stable. The overall accuracy ranging from 99.79 to 99.99% were obtained using the associated test sets (Table 2). We utilized backward elimination method, in which feature selection was started with the full-dimensional peakset (447 peaks) and sequentially deleted the least important peaks. We also applied forward elimination method, in which feature selection was started with an empty peakset and sequentially added the most important peaks, on the RFE-SVM with multiple randomly sampling strategy, ranking consistency evaluation, and consensus scoring system. Similar results can be obtained from both methods and the overall accuracies and consistency rates from both methods were significantly higher than those obtained using a typical RFE-SVM procedure (Table 3).
Hierarchical clustering analysis was performed on the 10 peak sets and 203 samples. The 78 samples from healthy individuals were not used in the peak selection procedure. Instead, they were employed in the evaluation step, together with the CLD and HCC samples. Table 1 shows the cluster analysis results obtained using the 10 peak sets derived in this study. Each peak set has the ability to stratify the samples into HCC Cluster, CLD Cluster and Healthy Cluster. HCC Clusters included 66–72 HCC, 0 CLD and 7–27 healthy samples. CLD Clusters included 0 HCC, 25–48 CLD and 10–21 healthy samples. Healthy Clusters included 1–7 HCC, 4–27 CLD and 49–62 healthy samples. These results indicate that 90–99% of HCC, 48–92% of CLD and 63–80% of healthy samples were grouped into correct clusters, suggesting that the peaks for differentiating HCC and CLD can also help to stratify HCC, CLD, and healthy people.
Hierarchical clustering analysis on the 7 peaks that were shared by all of the 10 peak sets shows three clusters (Table 1), including HCC Cluster (66 HCC and 7 healthy), CLD Cluster (48 CLD and 21 healthy), and Healthy Cluster (50 healthy, 4 CLD and 7 HCC). It can be conclude that 90% of HCC were in HCC cluster, 92% of CLD were in CLD cluster, and 64% of healthy sample were in healthy cluster. These results are encouraging since the 78 healthy samples were not involved in the peak selection procedure. Using the 7 peaks, 96% accuracy was obtained in distinguishing HCC from noncancer individuals (CLD and healthy), and 100% accuracy was obtained in separating HCC from CLD patients, by LOOCV evaluation from SVM classifiers. Only 8 healthy samples were misclassified despite the fact that all of the 78 healthy samples were not involved in the peak selection procedure. These results demonstrate that the selected 7 peaks can predict the sample groups accurately.
To further evaluate the predictive capability of the selected peaks, we generated 19 new MALDI-TOF spectra from Egyptian population using a similar protocol as described in previous studies.14,20 The 19 new spectra, which included 9 CLD and 10 HCC cases, were generated by different institute, by different people and at different time from the spectra used for peak selection. The preprocessing steps including binning, baseline correction, peak identification, peak alignment and isotopic distribution analysis were performed in the same way as the previous 203 samples. The peaks in the new data set were chosen if the m/z values of the peaks in new data set were within the ±0.5 Da differences with the m/z values of seven peaks we derived. We built an SVM model by using the seven peaks and the 73 HCC and 52 CLD spectra we previously generated,14 and predicted the 19 spectra we newly generated. Five CLD and 10 HCC cases were predicted correctly. Four CLD cases were misclassified as HCC cases, while no HCC case was misclassified as CLD case. This indicates that 79% of the newly generated samples were predicted correctly, suggesting good performance of the selected 7 peaks in predicting this new data set. The reason why the four CLD cases were predicted as HCC cases whereas all of the HCC cases were predicted correctly might be because of the unbalanced data set we used in building the SVM model, where the training data contained spectra derived from 73 HCC and 52 CLD cases. In order to keep the balance, we reduced the HCC sample size in the training set to 52, and rebuilt the model. The HCC accuracy (sensitivity) was decreased from 100% to 80% and CLD accuracy (specificity) was increased from 56 to 78%. The overall accuracy was 79%, the same as what we obtained in the first model. Another solution to improve the predictive accuracy is to increase the CLD sample size in the training set.
Association of the glycans and covariates (gender, smoking, HCV infection, and HBV infection) with HCC was analyzed by univariate logistic regression using the 78 healthy and 73 HCC spectra (Table 4). The analysis shows that five of the seven selected glycan markers (m/z values as 1580, 1996, 2186, 4311, and 4502) and serological markers of viral infections (HBV and HCV viral infections) were strongly associated with HCC. Table 4 also shows that the seven selected glycan peaks have no association with smoking and gender.
Figure 3 shows the glycan intensity analysis of the seven peaks among HCC patients, CLD patients, and healthy individuals. Structural composition of 4 out of 7 glycans, with the m/z values at 1580, 2040, 2187 and 2851, were characterized by a combination of enzymatic sequencing and MALDI-TOF/TOF tandem mass spectrometry18,33 (Table 5). The structural analysis of the glycan at m/z 4502 is underway. On the basis of its molecular weight and isotopic distribution, a possible structure match of this glycan was found in the database http://www.functionalglycomics.org/static/index.shtml. The glycan with the m/z value at 4052 was selected by all of our previous studies.14,18-20 The glycan with the m/z value at 2040 was chosen by two of our previous studies.18,19 The glycan with the m/z values at 2187 was selected in one of our previous studies.18 However, neither the m/z value at 1580 nor the m/z value at 2851 was discovered by our previous studies.
Literature search indicates that some cancer-related proteins have glycan compositions as the four glycans we selected with known structure. The high mannose glycan at MW 1580 could be enzymatically released from a research anticancer target beta-glucuronidase (hCG)34,35 and a successful anticancer target cytochrome P450.34,36 Glycan at MW 2187 could be released from a research anticancer target chorionic gonadotrophin.34,37 The expressions of these two glycans were gradually increased from healthy people, CLD to HCC cases (Figure 3), implicating possibly important involvement of these two glycans and their associated glycoproteins in the HCC progression. The two fucosylated glycans at MW 2040 and 2851 could be enzymatically released from human immunoglobulin IgG.38-41 A comprehensive determination of the glycan composition at glycosylation sites in these proteins is needed to reveal post-translational modifications of glycoprotein, analyze specific function to the carbohydrate moiety, and determine the potential clinical utility of the glycan markers for early diagnosis of HCC.
Changes in glycosylation are associated with physiological and pathophysiological conditions of cells. Glycan marker discovery would be helpful to determine the physiological state of patients. We identified 7 N-glycan serum markers associated with HCC from mass spectrometry data. We used the 7 glycans to achieve good performance in distinguishing HCC patients from CLD patients and healthy people with unsupervised clustering and supervised classification methods and with newly generated data set. The structures of 4 glycans could be identified in the important cancer-related glycoproteins. Further analysis and evaluation of these marker candidates are needed to validate their clinical utility for early diagnosis of HCC.
This work was supported in part by the National Science Foundation Grant IIS-0812246, the National Cancer Institute (NCI) R21CA130837 Grant, NCI R03CA119313 Grant, NCI Early Detection Research Network Associate Membership Grant, and the Prevent Cancer Foundation Grant awarded to H.W.R.