|Home | About | Journals | Submit | Contact Us | Français|
A more accurate taxonomy of small intestinal (SI) neuroendocrine tumors (NETs) is necessary to accurately predict tumor behavior, prognosis and define therapeutic strategy. We identified a panel of such markers implicated in tumorogenicity, metastasis, and hormone production and hypothesized that transcript levels of MAGE-D2, MTA1, NAP1L1, Ki-67, Survivin, FZD7, Kiss1, NRP2, and CgA could be used to define primary SI NETs and predict the development of metastases.
Seventy three clinically and World Health Organization (WHO) pathologically classified NET samples (primary: n=44; liver metastases: n=29) and 30 normal human Enterochromaffin (EC) cell preparations were analyzed using real-time PCR. Transcript levels were normalized to three NET house-keeping genes, ALG9, TFCP2 and ZNF410, using GeNorm. A predictive gene-based model was constructed using supervised learning algorithms from the transcript expression levels.
Primary SI NETs could be differentiated from normal human EC cell preparations with 100% specificity and 92% sensitivity. Well-differentiated NETs (WDNETs), well-differentiated neuroendocrine carcinomas (WDNECs), and poorly differentiated NETs (PDNETs) were classified with specificities of 78%, 78%, and 71% respectively, while poorly differentiated neuroendocrine carcinomas (PDNECs) were misclassified as either WDNETs or PDNETs. Metastases were predicted in all cases with 100% sensitivity and specificity.
Gene expression profiling and supervised machine learning can be used to classify SI NET subtypes and accurately predict metastasis. Application of this technique will facilitate accurate molecular pathological delineation of NET disease, better define its extent and facilitate the assessment of prognosis as well as providing a guide to identification of an appropriate strategy for individualized patient treatment.
The precise pathological classification of small intestinal (SI) neuroendocrine tumors (NETs) is important since individual histopathological subtypes are associated with distinct clinical behavior 1. However, the difficulty of definitive histopathological assessment of specific tumors has limitations and in some circumstances does not provide a completely reliable prediction of behavior. Nevertheless, current pathological classifications of lesions remain the basis for the identification of tumor type, prediction of biological behavior and delineation of treatment strategy. The classification criteria adopted by the World Health Organization (WHO) in 2000 for NETs utilizes size, proliferative rate, localization, differentiation, and hormone production 2. A distinction is made between well differentiated NETs (WDNETs) (benign behavior or uncertain malignant potential), well differentiated neuroendocrine carcinomas WDNECs (low-grade malignancy), poorly differentiated neuroendocrine tumors (PDNETs) medium-grade malignancy, and poorly differentiated (usually small cell) neuroendocrine carcinomas (PDNECs) of high-grade malignancy. This system has not been widely adopted and alternatives have been proposed although all rely upon parameters that are open to observer variation and interpretation.
Diagnostic pathology has traditionally relied on macro- and microscopic histology and tumor morphology as the basis for tumor classification. Current classification frameworks, however, are not validated and are unable to accurately discriminate among tumors with similar histopathologic features, which may vary substantially in clinical course and in response to treatment 3. This reflects either incorrect classification or the fact that an individual lesion may evolve into a different type of NET pattern or that there may exist within one tumor more than one phenotype 4. The recognition of these limitations has led to an interest in modifying the basis of tumor classification from a purely morphologic stratification to a system that includes molecular parameters predictive of biological behavior. The myriad of genes and their permutations and commutations in neoplastic lesions has, however, proved difficult to investigate and conventional analysis renders the assessment of the relevance of changes intricate and complex to quantify. The advent of machine learning offers a logical approach for developing sophisticated, automatic, and objective algorithms for the analysis of high-dimensional and multimodal biomedical data. In several studies, Support Vector Machines (SVM), a variant of the supervised learning algorithm, have been used to predict the grading of astrocytomas 5 with a >90% accuracy, and prostatic carcinomas with an accuracy of 74–80% 6. A similar predictive tool, Decision Trees, has also been demonstrated to have utility (70–90%) in predicting the prognosis of metastatic breast cancer 7 as well as colon cancer 8.
The effectiveness of such mathematical tools is, however, dependent upon selection of the appropriate variables (genes/biomedical data e.g. serum hormone levels) which, in any classification model, is a crucial requirement that determines the success of a predictive algorithm. While microarray-based approaches have been employed in classifying leukemia subtypes 9 and endocrine tumors 10, identifying a reliable selection of gene candidates is difficult due to the high opportunity for errors arising from the intrinsic “curse of dimensionality”, variability present in any microarray readout, as well as the complexity (difficulty) of finding appropriate normalization standards 11. Real time-PCR is considered the gold standard to validate microarray-data. Such an approach is highly reproducible, has a greater dynamic range than microarray technology and can detect small changes in data, and is easy to introduce into a clinical laboratory 12. In this study, we hypothesized that transcript levels of a panel of markers that we and others have implicated in tumorogenicity, metastasis, and hormone production of NETs - MAGE-D2, MTA1, NAP1L1, Ki-67, Survivin, FZD7, Kiss1, NRP2, and CgA 13–19 - could be used to classify primary SI NETs and predict the development of metastases. We performed extensive transcript-level analysis and implemented supervised learning algorithms to construct a model using the regulatory genotype (adhesion, migration, proliferation, apoptosis, metastasis and hormone secretion) that underlies SI NET subtypes for the development of a predictive gene-based model.
A total of 53 clinically annotated SI NETs (frozen biobank, all tumors classified as functional; tissues microdissected, >80% pure neoplastic cells) and 13 normal human EC cell preparations (obtained from FACS sorting normal mucosa; >98% pure EC cells 20) were collected for real-time PCR analysis. The clinical data set contained primary SI NETs (n=36), and liver metastases from corresponding primary tumor types (n=17). Tumors were pathologically classified according to the WHO standards as WDNET (n=18), WDNEC (n=9), PDNET (n=7), PDNEC (n=2). Metastatic (MET) tissue (all collected from liver specimens) from the corresponding tumor types was classified in the similar fashion: WDNET MET (n=6), WDNEC MET (n=8), and PDNEC MET (n=3). No samples of PDNET metastases were collected. All patients were enrolled according to protocols approved by the institutional review board of Yale University.
RNA was extracted (TRIZOL®, Invitrogen, USA) 21 from the training set consisting of 36 primary tumors, 17 liver metastases, and 13 normal human EC cell preparations. Additional RNA was extracted from the independent test set consisting of normal EC cell preparations (n=17), localized SI NETs (n=8), and malignant SI NETs (n=12). All tumor samples were described as functional NETs. Transcript levels of the 9 identified marker genes (MAGE-D2, MTA1, NAP1L1, Ki-67, Survivin, FZD7, Kiss1, NRP2, and CgA 13, 14) were measured using Assays-on-Demand™ products and the ABI 7900 Sequence Detection System according to the manufacturer’s suggestions 21. Cycling was performed under standard conditions (TaqMan® Universal PCR Master Mix Protocol) and raw ΔCT values normalized using geNorm22 and expression of the novel house-keeping genes, ALG9, TFCP2 and ZNF41023. Normalized data were natural log (ln)-transformed for compression.
Transformed data were imported into Partek® Genomic Suite 24. Principal Component Analysis (PCA) was used to visualize patterns present in the data and determine whether or not individual tumor subtypes as well as normal EC cell transcript profiles could be differentiated. Additionally, mean expression values (M) and Standard Deviations (SD) of each marker gene were measured in tumor subtypes and normal EC cell preparations. Subsequently, Analysis of Variance (ANOVA) was performed to identify gene expressions that were significantly (p<0.05) changed between normal EC cell preparations and primary tumor tissues, as well as between normal EC cell preparations and specific tumor types. Classification was performed using supervised learning algorithms – Support Vector Machines (SVM), Decision Tree, and Perceptron 25. Feature Selection (FS) was used to chose the best subset of features for the robust learning models 26.
When applied to biology, this algorithm removes the most redundant features from a data set which enhances the generalization capability, accelerates the learning process and improves model interpretability 26. We used a “greedy forward” selection approach to select the most relevant subset 26.
This exploratory technique is used to describe the structure of high dimensional data by reducing its dimensionality to uncorrelated principal components (PCs) that explain most variations in the data 27. PCA mapping was visualized in a 3-dimentional space where X-, Y-, and Z-axis represent 1st, 2nd, and 3rd PCs respectively.
A two-class unpaired algorithm was implemented with tumor types and normal EC cell preparations defining the two groups. There were no missing values in our dataset therefore imputation was unnecessary. Geometric Fold Change (FC) was calculated as the ratio of geometric means of the Tumor Group and the Normal Group. Genes with expression p<0.05 (between normal and tumor) were considered significantly changed.
This algorithm attempts to classify the data by increasing the margin between the n data sets 28. Radial basis function was used as a kernel and a 10-fold cross-validation was used to measure the sensitivity of classification 28. Previous studies have utilized this method to predict the grading of astrocytomas 5 with a >90% accuracy, and prostatic carcinomas with an accuracy of 74–80% 6.
This is a predictive model that maps observations about an item to a conclusion about its target value 29. The leaves of the tree represent classifications and branches represent conjunctions of features that devolve into the individual classifications. A 10-fold cross-validation was used to measure the efficiency of this technique 30. In previous studies this approach was effective (70–90%) in predicting prognosis of metastatic breast cancer 7 as well as colon cancer 8.
This is a linear classifier that forms a feed forward neural network and maps an input variable to a binary classifier 25. Three data scans were used to generate the decision boundaries that explicitly separate data into classes. A learning rate of 0.05 (this constant regulates the speed of learning. A lower learning constant improves the classification model at the expense of the time it takes to process the variable) was used 31. This algorithm was used to distinguish between localized tumors and the corresponding metastases. This methodology has previously been shown to be effective in predicting malignancy of breast cancer 31.
To determine whether primary SI NETs, normal EC cell preparations, and respective metastases could be differentiated, PCA was used to reduce real-time PCR expressions of the 9 marker genes into three PCs that reflect most of the variance in the dataset. In primary SI NETs and normal EC cell preparations, 31.7% of the variance was represented by PC#1, 26.5% by PC#2, and 17.4% by PC#3; overall 75.6% of the variance was represented by all 3 PCs (Figure 1A). In metastases, 40.4% of the variance was represented by PC#1, 19.9% by PC#2, and 12.9% by PC#3; overall 73.2% of the variance in the data was represented by all 3 PCs (Figure 1C). The distance between the centroids (“centers of mass”) for each tumor subtype in the three dimensions represents the relative similarity of their regulatory signatures (transcript expression levels). As assessed by this method, the individual tumor subtypes, normal EC cell preparations, and SI NET metastases have distinct regulatory signatures. PCA analysis also identified clusters of related gene expressions, which are associated with the cosine of the angle between individual expression vectors 32. Thus, in primary SI NETs, related gene clusters are as follows: 1) CgA, NRP2, NAP1L1, FZD7; 2) MAGE-D2, MTA1, Kiss1; 3) Ki-67, Survivin (Figure 1B). In corresponding metastases, the related groups are: 1) NAP1L1, FZD7, CgA, Survivin, Ki-67, Kiss1; 2) MTA1, MAGE-D2, NRP2 (Figure 1D).
In the next analysis, the mean expression level of each marker gene in primary tumor subtypes and normal EC cell preparations was calculated. Mean transcript expressions in normal EC cell preparations of CgA (MNormal = −9.2, SD = 4.2), Ki-67 (MNormal = −4.5, SDNormal = 1.1), Kiss1 (MNormal = −4.0, SDNormal = 3.2), NAP1L1 (MNormal = −8.3, SDNormal = 1.1), NRP2 (MNormal = −9.3, SD = 3.8), and Survivin (MNormal = −6.0, SDNormal = 1.0) were significantly (p<0.01) different from primary tumors (Table 1). To evaluate the reproducibility of the real-time PCR approach, measurement of target transcript expression levels was re-evaluated in a subset of samples (n=35). The data from the was highly correlated (R2=0.93, p=0.001) demonstrating this approach was both highly reproducible and robust.
We then utilized ANOVA to measure the change in transcript levels across tumor subtypes and normal EC cell preparations (Table 1). Only transcripts with p<0.05 and absolute FC≥2.0 were considered differentially expressed. CgA, FZD7, Ki-67, NAP1L1, NRP2, and Survivin were significantly altered in WDNETs compared to normal EC cell preparations. Transcript levels of CgA, Ki-67, MAGE-D2, and NRP2 were significantly changed in WDNECs. PDNETs displayed alternatively expressed levels of CgA, Ki-67, NAP1L1, NRP2, and Survivin. Finally, PDNECs were different only in expressions of NAP1L1 and NRP2.
To determine if the distribution of the primary tumor subtypes and normal EC cell preparations could be linearly separated, correlation coefficients for each marker gene pair were computed (Figure 2). Of note, MTA1:MAGE-D2, MTA1:Kiss1, FZD7:NAP1L1, and Survivin:Ki-67 correlation pairs were determined to be highly linear (R2 > 0.50). Additionally, distribution of WDNETs, WDNEC, and PDNETs was linear as determined by the pair-wise expressions of Kiss1:Survivin, FZD7:NAP1L1, Survivin:MTA1, and MTA1:MAGE-D2. These findings indicate that a linear classifier could be further applied to the dataset.
SVM performed best in differentiating normal EC cell preparations (n=13) and the primary SI-NETs (n=36). Feature selection identified NAP1L1, FZD7, Kiss1 and MAGE-D2 as the best variables for the classification model. Scatter plots of the SI NETs and normal EC cells colorized to the density of the samples produced differential zones that depended on the individual gene expressions (Figure 3). SVM predicted SI NETs with the 100% sensitivity and 92% specificity (Figure 3). 77% of normal EC cell preparations were predicted accurately and the class specificity was 100% (Table 2).
A Decision Trees classification performed best using Ki-67 and NAP1L1 as identified by FS. WDNETs were predicted with 78% sensitivity, WDNECs – 78%, PDNETs – 71%, and PDNECs were misclassified as either WDNETs or PDNETs (Figure 4). The prediction specificities for WDNETs, WDNECs, and PDNETs, were 82%, 64%, and 63% respectively (Table 3).
ANOVA was performed to identify differentially expressed transcripts in primary SI NET subtypes and corresponding metastases (Table 4). Significant gain of Kiss1 (p<0.005) was associated with all tumor subtypes.
To predict the metastasis of primary WDNETs, MAGE-D2, NAP1L1, and Kiss1 (as identified by FS) were used in SVM to construct a classifier. WDNETs and WDNET METs were predicted with 100% sensitivity and specificity. To visualize metastatic potential of primary tumors, samples were plotted in correlation with the selected gene expression levels and distribution densities were colorized to outline the separation of primary and metastatic samples (Figure 5A). WDNET could be predicted to metastasize if transcript levels of 1) NAP1L1 > −2.71 and Kiss1 > −2.50; 2) NAP1L1 > −3.82 and MAGE-D2 > −4.42; 3) MAGE-D2 > −3.21 and Kiss1 > −2.12.
A Perceptron classifier was used to predict metastases of WDNECs and PDNECs. NAP1L1 and Kiss1 were specific to WDNEC METs and CgA was specific to PDNEC METs, as determined by the FS algorithm. Metastases of all primary tumors were predicted with the sensitivity of 100% and specificity of 100%. Metastatic potential of primary tumors was visualized by plotting expressions of featured genes and colorizing the distribution densities of primary tumors and their metastases (Figure 5B, 5C). WDNECs were predicted to metastasize with values of NAP1L1 > −5.28 and Kiss1 > −2.83, while PDNECs could be predicted to metastasize when CgA > −3.5.
To validate the predictive model, real-time PCR was used to measure the marker gene transcript expression in an independent set of SI NETs (n=37) consisting of normal EC cell preparations (n=17), localized SI NETs (n=8), and malignant SI NETs (n=12). All WDNETs were considered as “localized” while other tumor subtypes were considered “malignant”. Assessment of linearly correlated transcript pairs identified a pattern similar to the training set whereas MTA1:MAGE-D2, MTA1:Kiss1, FZD7:NAP1L1, and Survivin:Ki-67 transcript pairs were highly correlated (R2>0.50). The trained SVM model was applied to differentiate normal EC cell preparations from neoplasia with 76% accuracy. The Decision tree model predicted localized and malignant NETs with 63% and 83% accuracy respectively (Figure 6). Furthermore, the F-test statistic was computed to confirm that the classification results of the training and the independent sets were not significantly different. The p-values for normal, localized, and malignant subgroups were 0.84, 0.25, and 0.80 respectively.
The critical issue in SI NETs is the prediction of biological aggression of an individual tumor since this defines treatment and predicts prognosis. Classically, this “calculation” has been based upon an amalgam of tumor size, extent of spread, histological classification, genetic background and assessment of proliferative markers (Ki-67 and mitotic index) 33. This approach has yielded data that are often not reproducible, may be inaccurate in certain tumors and is based upon pattern recognition. Given the recent advances in the identification of the molecular signatures of individual types of neoplasia and the utility of novel artificial intelligence techniques for analyzing complex data, we sought to re-evaluate SI NET neoplasia since current strategies are notoriously difficult in this particular tumor type.
We analyzed transcript levels of genes with known or suspected roles in NET neoplasia (MAGE-D2, MTA1, NAP1L1, Ki-67, Survivin, FZD7, Kiss1, NRP2, and CgA) and applied supervised machine learning algorithms to assess their prognostic power in primary SI NET subtypes and corresponding metastases. Our analysis demonstrated that primary SI NETs could be differentiated from normal human EC cells with 100% specificity and 92.3% sensitivity. WDNETs, WDNECs, and PDNETs were differentiated with specificities of 77.8%, 77.8%, and 71.4% using the same methodology, while PDNECs were misclassified as either WDNETs or PDNETs. The likelihood of metastases was predicted with 100% sensitivity and specificity. Furthermore, we applied the model to an independent set of SI NETs and demonstrated that normal EC cell preparations could be differentiated from neoplasia with 76% accuracy, while localized and malignant SI NETs could be predicted with 63% and 83% accuracy respectively.
Current pathological classification frameworks are not validated and are unable to consistently discriminate among tumors with similar histopathologic features, which may vary substantially in either their clinical course or response to treatment 3. Although supervised machine learning algorithms incorporating clinical, histopathological, and molecular data have been used to classify unknown astrocytomas 5, prostatic carcinomas 6, metastatic breast cancer 7, and colon cancer 8, no attempts have been made to apply similar methodology to gastrointestinal NETs. We have determined that the SVM classifier performed best while differentiating normal EC cells and SI NETs. This presumably represents the clear demarcation of molecular signatures characteristic to the normal and neoplastic samples as measured by real-time PCR. The seemingly homogeneous distribution of primary SI NET histological subtypes as defined by marker gene levels, did not present a clear distinction and the SVM classification performed poorly under these conditions. Decision Trees, on the other hand, performed well in classifying primary SI NET subtypes. In particular, only NAP1L1 (nucleosome assembly protein 1-like) and Ki-67 (marker of proliferation) were selected by the FS algorithm to be the most relevant transcripts for this classification model. In the Decision Tree, the root position of NAP1L1 and its subsequent properties in distinguishing WDNETs, WDNECs, PDNETs, and PDNECs suggests that it functions as the prime component of the decision process. This is consistent with our previous studies that identified this mitotic regulatory gene to be implicated in malignant progression of the SI NETs as well as appendiceal tumors 13, 34. Ki-67 levels were only indicative of PDNETs. This observation is consistent with numerous reports that the expression of the human Ki-67 protein is associated with cell proliferation 35 and can be used as a malignancy marker in prostatic cancer 36, adult-type granulosa cell tumor of the ovary15, and mesothelial proliferations 37. Immunohistochemical assessment of NETs has identified Ki-67 as the best predictor of survival in neuroendocrine tumors 38. The Decision Trees misclassified PDNECs as either WDNET or PDNET. This was most likely due to the fact that the genes we used as classifiers were largely NET-specific 13, 14 and many of these lesions have a variable phenotype with adenocarcinomatous elements and often lack overt or consistent neuroendocrine features 39. The small sample size of this group in our current study does not permit further analysis of this subgroup and is an important issue that requires further investigation.
SVM classifier performed best differentiating primary WDNETs and their metastases.. However, WDNECs, PDNECs, and their metastases were homogeneously separated in two-dimensional space, suggesting a distinct gene transcription between the primary tumors and their metastases. Perceptron performed best under these conditions, which is consistent with the requirements (discrete homogeneity) for this neural network 25. The FS algorithm determined NAP1L1, MAGE-D2, and Kiss1 to be implicated in the metastasis of WDNETs, NAP1L1 and Kiss1 in the metastasis of WDNECs, and CgA in the metastasis of PDNECs. Transcript levels of Kiss1, the metastasis suppressing gene, were significantly elevated in all metastases of primary SI NETs. Although previous studies have noted a loss of Kiss1 in human breast cancer 40, and metastatic melanoma 41, there has been a report of a gain of Kiss1 in the highly metastatic human bladder T24T cells when compared to the poorly metastatic parental T24 line 42. Our observations suggest the possibility that Kiss1 has a function other than metastasis suppression or may in fact be a metastasis promoter in SI NETs. An alternative explanation may be contamination with hepatocytes during dissection of the hepatic metastases. A comparison of Kiss1 levels demonstrated these were ~20x higher (p=0.0013, 2-tailed Mann-Whitney test) in normal hepatic tissue compared to normal small intestinal mucosa. It is therefore possible that tissue from the surrounding environment (liver) with high Kiss1 transcripts may be included in the metastatic samples and contribute to the elevated Kiss1 levels noted. In our investigations of the neoplastic EC cell line KRJ-I 43, we noted a 4.5-fold down regulation of Kiss1 transcript (unpublished observations). To examine the utility of the classification algorithm, we examined gene expression in KRJ-I and determined this was a WDNET – confirming the initial histology of the tumor 44.
Validation of the predictive model in an independent set of normal EC cell preparations and SI NETs has confirmed that the model is robust and reproducible. Transcript pairs (MTA1:MAGE-D2, MTA1:Kiss1, FZD7:NAP1L1, and Survivin:Ki-67) in an independent set were linearly correlated (R2>0.50), identical to those of the training set. Although there was no significant statistical difference (p>0.2) between the classification accuracy of the training and independent sets, localized tumors in an independent set were classified at a lower accuracy rate. This can be attributed to a smaller sample size of this subgroup within the independent set. It is important to note, however, that the model has performed best for the neoplastic tissue suggesting that the present marker gene panel is a good indicator of malignancy. This is supported by a subset analysis we performed where transcripts of genes shown to predict metastasis (MAGE-D2, NAP1L1, Kiss1, and CgA) were excluded from the training and independent sets. The re-trained SVM model and the resulting classifier was applied to the independent set and identified to accurately classify only 2/17 (12%) samples as normal EC cells. This indicates that the metastasis-predicting gene expression is crucial in differentiating between normal and tumor samples.
This study offers an illustration of how gene expression profiling and supervised machine learning can be used to classify SI NET subtypes and accurately predict metastasis. The use of novel mathematical techniques of gene analysis provides a new tool that can be used to increase the accuracy of prediction of tumor behavior and therefore allow for a rational assessment of treatment strategies and prognosis. This technique of gene expression analysis may also have application in the assessment and delineation of the specific cellular mechanisms involved in NET cell proliferation and provides the opportunity to identify key metabolic sites that can be therapeutically targeted. A large-sample, investigative analysis of tumor type- and tumor grade-specific markers is required to confirm these results. It is, however, likely that the application of this technique or modifications thereof will become a valuable adjunct to current pathological identification and staging techniques in facilitating accurate identification of the biological nature NET disease, and determining its prognosis. Such information will provide the basis for the identification of an appropriate therapeutic strategy for individual tumors (patients) as opposed to the ineffective “one treatment for all tumors” concept, currently utilized.
Financial support: NIH R01-CA115285 (I. Modlin)
Financial Disclosure: There are no financial disclosures from any author