Toward our overall goal to identify potential serum based biomarkers to predict lung cancer in high risk subjects, including patients with CT detected pulmonary nodules, we applied bottom-up LC-MS/MS based proteomic analyses to a large set of pooled clinically ascertained NSCLC case sera and matched controls. Pooled sera were subjected to MARS14 depletion, followed by in solution trypsin digestion, and LC-MS/MS analysis. A large number of low abundant proteins were identified, suggesting relatively broad coverage of the serum proteome. By comparing the proportions of the 14 MARS14 depleted serum proteins in serum protein mass versus total spectral counts, we found differential effectiveness of MARS14 spin cartridges in depletion of different proteins, with depletion efficiency highest for serum albumin. The differential effectiveness probably reflects the quality and quantity of antibodies utilized in the MARS14 spin cartridges.
The relative quantitation of protein abundance in our study was achieved via spectral counting. To increase reliability of the spectral counting, we analyzed our samples in triplicate and only included proteins with ≥2 different peptides and with >15 total spectral counts. We recognize that this increase in reliability comes with a substantial sacrifice with a large number of proteins excluded from quantitation. Of the total of 20,916 proteins identified from the combined 93 LC-MS/MS runs, 11,549 proteins were identified with ≥2 different peptides; however only 1127 of them had >15 total spectral counts.
Although spectral counting has gained increased popularity for relative quantitation of protein abundance in label-free LC-MS/MS workflows, statistical tools for spectral counting based comparative analysis are still immature. Since spectral counting is count data in nature, Poisson regression has been utilized by some researchers for comparative analysis of spectral counting based data.18,19
However, Poisson regression assumes that the dependent variable (herein spectral counting) follows the Poisson distribution and therefore the variance is equal to the mean. We tested the performance of Poisson regression on our dataset and found that about 10% of the resultant Poisson regression models had deviance/degree of freedom (df
) >2, indicating the presence of over-dispersion for some proteins (data not shown). Therefore we decided to use a negative binomial regression model, a useful alternative for data over-dispersed for the Poisson regression model by incorporating an additional parameter to adjust variance independently of the mean.13
We applied negative binomial regression to the 1127 proteins with ≥2 different peptides and with >15 total spectral counts. A robust discovery set (N=49) of lung cancer selective proteins was identified. We also compared the negative binomial analysis results to those obtained using the non-parametric Wilcoxon rank sum test. The same feature selection criteria using the Wilcoxon rank sum test yielded a smaller set of significant features (N=40), with 28 of them also detected by negative binomial regression. A number of the selected proteins have been identified in the previously referenced proteomics studies,9,10
such as alpha-1-acid glycoprotein 1, N-acetylmuramoyl-L-alanine amidase, gesolin, haptoglobin, ficolin-3, and beta-ala-his dipeptidase. Consistent with the important role of inflammation in tumor development,20
functional analysis by IPA showed a significant enrichment of proteins associated with inflammatory responses or acute phase reaction.
The robustness of the spectral counting derived abundance for these selected proteins in discriminating case pools and control pools was tested using unsupervised hierarchical clustering, a statistical method for grouping subjects based on similarity of measured characteristics. The resulting analysis showed almost complete separation between the NSCLC case pools and control pools with the exception of only one case pool (P15) and one clinical control pool (P18). However it is worth mentioning that pool P15, although being clustered together with the control pools, also shared substantial similarity in terms of protein abundance with the sub-cluster (cluster #1b) consisting of 4 other squamous cell carcinoma pools. Also worth noting is the fact that many of the clinical control subjects, although cancer free at the time of serum sample collection, had non-malignant lung disease as demonstrated by the presence of CT detected pulmonary nodules. The mixed-clustering of the clinical control pool P18 with the NSCLC case pools may therefore be the result of the contributions of inflammatory proteins included in our list of significant features. Nonetheless, our analysis was able to separate the majority of clinical control pools (7 of the 8 pools) from NSCLC case pools, suggesting that our selected features may potentially be useful in increasing the specificity of thoracic CT scans for lung cancer screening, thus reducing the need of invasive thoracic procedures for patients with CT detected indeterminate nodules.
Adenocarcinoma and squamous cell carcinoma are the two most common NSCLC subtypes. These two NSCLC subtypes have been shown to differ significantly both in terms of their clinical behavior and molecular signatures.21,22,23
Interestingly, by comparing significant features selected based on the comparison of adenocarcinoma case pools to all control pools, versus significant features selected based on the comparison of squamous cell carcinoma case pools to all control pools using comparative functional analysis with IPA tools, we found differential enrichment of several functional categories for these two NSCLC subtypes. Lung adenocarcinoma patients appear to experience a greater pro-inflammatory response based on the observed alterations in the serum proteome. More inflammatory response related proteins are differentially abundant in adenocarcinoma case pools compared to squamous cell carcinoma case pools. Significantly higher numbers of adenocarcinoma selective proteins have been previously associated with several inflammatory related conditions such as infectious, respiratory, gastrointestinal, and dermatological diseases. Also, among the inflammatory related proteins differentially expressed in both of these NSCLC subtypes, such as AAG1, AAG2 and SAA, the magnitude of change appears to be larger for the adenocarcinoma case pools (). In contrast, squamous cell lung carcinoma patients seem to experience a greater response in plasma lipid physiology, with more differentially abundant proteins involved in molecular transport and small molecule biochemistry. Both of these two functional categories involve proteins responsible for transport, secretion, clearance, and/or metabolism of lipids. Squamous cell carcinoma patients also seem to have a greater deregulation in their endocrine system, with a greater number of differentially abundant proteins associated with endocrine system disorders. We also observed a greater magnitude of change in squamous cell lung carcinoma for endocrine system related features shared by both subtypes, such as neuregulin-2, apolipoprotein B-100, and beta-ala-his dipeptidase (). Interestingly, the majority of these proteins (18 out of 19) have been previously related to diabetes mellitus. Smoking has been shown to contribute to endocrine disorders including the development of insulin resistance and hence type 2 diabetes mellitus.24
Since squamous cell carcinoma is the more frequent NSCLC subtype diagnosed in heavy smokers, it is not clear whether our observation is a reflection of more heavy smoking in the squamous cell carcinoma cases in our study.
In summary, we have presented data from an initial comparative proteomics study for lung cancer serum biomarker discovery and generated a robust discovery set of candidate proteins with altered abundance in lung cancer. Functional analyses support the important roles of inflammation in cancer and also reveal differential physiological responses related to different NSCLC subtypes. The clinical utility of these candidate lung cancer serum biomarker proteins needs to be validated with additional analytical platforms as well as in independent case/control sample sets.