PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Thorac Oncol. Author manuscript; available in PMC Apr 1, 2012.
Published in final edited form as:
PMCID: PMC3104087
NIHMSID: NIHMS266301
Lung Cancer Serum Biomarker Discovery Using Label Free LC-MS/MS
Xuemei Zeng, Ph.D.,* Brian L. Hood, Ph.D.,*Ұ Ting Zhao, Ph.D.,*Ұ Thomas P. Conrads, Ph.D.,*§ Mai Sun, M.S.,* Vanathi Gopalakrishnan, Ph.D.,$Λ Himanshu Grover, B.S.,$ Roger S. Day, Sc.D.,$ Joel L. Weissfeld, M.D., M.P.H., David O. Wilson, M.D., M.P.H., Jill M. Siegfried, Ph.D.,§ and William L. Bigbee, Ph.D.*§#
*Mass Spectrometry Platform, Cancer Biomarkers Facility, University of Pittsburgh Cancer Institute, Pittsburgh, PA
§Lung and Thoracic Malignancies Program, University of Pittsburgh Cancer Institute, Pittsburgh, PA
Cancer Epidemiology Program, University of Pittsburgh Cancer Institute, Pittsburgh, PA
Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh School of Medicine, Pittsburgh, PA
$Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA
Department of Pulmonary Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA
#Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA
Department of Pharmacology & Chemical Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA
ΛDepartment of Computational Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA
Address for correspondence: William L. Bigbee, Ph.D., Magee Womens Research Institute, Suite B411, 204 Craft Avenue, Pittsburgh, PA 15213. William L. Bigbee ; bigbeewl/at/upmc.edu
ҰPresent address: Womens Health Integrated Research Center at Inova Health System, 3289 Woodburn Rd, Suite 375, Annandale, VA 22003.
Introduction
Lung cancer remains the leading cause of cancer-related death with poor survival due to the late stage at which lung cancer is typically diagnosed. Given the clinical burden from lung cancer, and the relatively favorable survival associated with early stage lung cancer, biomarkers for early detection of lung cancer are of important potential clinical benefit.
Methods
We performed a global lung cancer serum biomarker discovery study using liquid chromatography-tandem mass spectrometry (LC-MS/MS) in a set of pooled non-small cell lung cancer (NSCLC) case sera and matched controls. Immunoaffinity subtraction was used to deplete the top most abundant serum proteins; the remaining serum proteins were subjected to trypsin digestion and analyzed in triplicate by LC-MS/MS. The tandem mass spectrum data were searched against the human proteome database and the resultant spectral counting data were used to estimate the relative abundance of proteins across the case/control serum pools. The spectral counting derived abundances of some candidate biomarker proteins were confirmed with multiple reaction monitoring MS assays.
Results
A list of 49 differentially abundant candidate proteins was compiled by applying a negative binomial regression model to the spectral counting data (p<0.01). Functional analysis with Ingenuity Pathway Analysis tools showed significant enrichment of inflammatory response proteins, key molecules in cell-cell signaling and interaction network and differential physiological responses for the two common NSCLC subtypes.
Conclusions
We identified a set of candidate serum biomarkers with statistically significant differential abundance across the lung cancer case/control pools which, when validated, could improve lung cancer early detection.
Keywords: Lung cancer, serum biomarkers, LC-MS/MS
Lung cancer is the leading cause of cancer death in the United States. The majority of lung cancer patients are diagnosed at advanced stage and have less than 15% 5-year survival rate. In contrast, early stage lung cancer has relatively favorable survival. A recent review reported that Stage 1A patients, with tumor size <10mm, experienced an 86% overall and 100% cancer specific 5-year survival with complete resection.1 Despite these statistics, lung cancer screening is currently not recommended due to lack of good screening modalities.2 Both chest radiographs and sputum cytology have low sensitivity and clinical trials failed to show a benefit for overall survival in the screened population. New computed tomography (CT) imaging technology is much more sensitive.3 However CT scans suffer from low specificity due to frequent detection of benign pulmonary nodules.4,5 Although published studies have demonstrated elevated serum levels of some proteins, such as CYFRA 21-1, CEA, and TPA,6,7,8 none of them are sensitive and specific enough for clinical use.
Proteomics has advanced rapidly, fueled by increasingly robust technologies, along with revolutions in bioinformatics methods for analyzing the high-dimensionality data. With the coupling of advanced capillary-based LC-separation online with tandem mass spectrometry (MS/MS) analyses, proteomics biomarker discovery workflows now permit the identification and relative quantitation of thousands of proteins in complex biofluids such as serum. This enabling technology has been recently applied to the search for lung cancer serum biomarkers.9,10 A number of lung cancer altered serum proteins were reported from these studies. However both studies had relatively small sample size, ranging from 3 to 5 individual case or control subjects. In addition, both studies limited the comparison to lung cancer patients and healthy controls only, without inclusion of high risk subjects such as those with indeterminate CT detected nodules and/or compromised pulmonary function.
Aiming to identify serum based biomarkers that could predict lung cancer risk in high-risk subjects and particularly in patients with pulmonary nodules, we applied a label-free LC-MS/MS workflow to a large set of serum pools obtained from NSCLC patients, clinical controls with CT detected benign pulmonary nodules, and healthy controls. Relative quantitation of protein abundance was carried out using spectral counting, i.e. the number of MS/MS spectra resulting in the identification of each protein’s corresponding peptides. The rationale for spectral counting derived protein abundance is that proteins in higher abundance will result in more proteolytic peptides detected by tandem MS and subsequently identified by database searching.11,12 While intuitively attractive, spectral counting may be influenced by differential ionization and/or fragmentation efficiency of different peptides, under-representation of lower intensity peptides, and dynamic exclusion of high intensity peptides. Nevertheless it has been demonstrated that spectral counting linearly correlates with protein abundance over a dynamic range of 2 orders of magnitude11 and is capable of detecting relative changes in protein abundances, especially for proteins with higher spectral counts.12 Lung cancer selective proteins were identified using a negative binomial regression model,13 a relaxed Poisson regression model for count data over-dispersed in regard to the Poisson distribution. The spectral counting derived abundances of some candidate proteins were confirmed using quantitative mass spectrometry by multiple reaction monitoring (MRM).14,15
Blood Sample Collection, Processing, and Storage
Peripheral blood samples were obtained from NSCLC patients and control subjects recruited as part of UPCI Lung Nodule/Lung Cancer Proteomics/Genomics Research Registry, together with the Pittsburgh Lung Screening Study (PLuSS), supported by the UPCI Lung Cancer SPORE. The University of Pittsburgh Institutional Review Board (IRB) approved all aspects of the study. A total of 54 newly diagnosed NSCLC patients (31 adenocarcinoma and 23 squamous cell carcinoma), 54 clinical controls with a CT detected nodule but only with non-malignant lung disease as confirmed by biopsy, and 106 healthy PLuSS controls were selected for current study. The clinical and demographic characteristics of the participants are summarized in Table 1. The significances for the demographic differences between cases and controls were determined using Fisher’s exact test (age), or the Chi-square test (gender and smoking history). None of these demographic characteristics were significantly different between cases and controls, with p values of 0.058, 0.208, and 0.782 for age, gender, and smoking history, respectively. Blood samples were collected, processed, aliquoted, and stored using the same rigorously validated Lung Cancer SPORE protocol. Aliquoted serum samples were frozen at −80°C and were not thawed prior to the sample pooling procedures; the pooled samples were then refrozen at −80°C until use.
Table 1
Table 1
Demographic and clinical characteristics of the subjects contributing to the pooled NSCLC case (N=54) and control (N=160) samples.
Lung Cancer Case/Control Discovery Serum Sample Pooling Design
Given the intensive preparative and analytical workflow of LC-MS/MS analysis, the individual serum samples were pooled into 9 adenocarcinoma pools (P01–P09), 6 squamous cell carcinoma pools (P11–P16), 8 clinical control pools (P17–P24), and 8 healthy control pools (P25–P32), according to histological subtypes, cancer stage, gender, and smoking status (active versus ex-smoker, with never smoker cases excluded). The overall goal of the pooling strategy was to construct case and matched control pools as homogenous as possible with respect to the most important clinical and demographic variables describing these samples (see Supplementary Table S1).
Immunoaffinity Removal of High Abundant Proteins
Agilent Human 14 Multiple Affinity Removal spin cartridges (MARS14, Agilent Technologies, Palo Alto, CA) were used to deplete the top 14 most abundant serum proteins (albumin, IgG, antitrypsin, IgA, transferrin, haptoglobin, fibrinogen, alpha2-macroglobulin, alpha1-acid glycoprotein, IgM, apolipoprotein AI, apolipoprotein AII, complement C3, and transthyretin). For each pool, 16 µl of serum was subjected to MARS14 depletion. The flow-through fractions were combined and desalted through buffer exchange with 50 mM NH4HCO3 using ultra-filtration with a 5 kDa molecular weight cutoff (Millipore, Bedford, MA). The Micro BCA™ protein assay kit (Thermo Scientific, IL) was used to determine the protein amount.
In Solution Trypsin Digestion
For each pool, 10 µg of desalted serum protein after MARS14 depletion was resuspended in 50 µl 100 mM NH4HCO3. After reduction with 10 minute boiling in the presence of 10 mM DTT and alkylation with 45 mM iodoacetamide (1 hour incubation in the dark at room temperature), 0.2 µg trypsin gold (Promega, Madison, WI) was added for overnight digestion at 37°C. The resulting tryptic peptides were desalted with PepClean™ C-18 Spin Columns (PIERCE, Rockford, IL), vacuum-dried and resuspended in 20 µl 0.1% TFA.
LC-MS/MS Analysis of Tryptic Peptides
Tryptic digests were analyzed in triplicate by reverse-phase LC-MS/MS using a nanoflow LC (Dionex Ultimate 3000, Dionex Corporation, Sunnyvale, CA) coupled online to a linear ion trap MS (LTQ-XL, ThermoFisher Scientific, San Jose, CA). Separations were performed using 75 µm inner diameter × 360 µm outer diameter × 15 cm long fused silica capillary columns (Polymicro Technologies, Phoenix, AZ) slurry packed in house with 5 µm, 300 Å pore size C-18 silica-bonded stationary phase (Jupiter, Phenomenex, Torrance, CA). Following injection of 2 µg of peptides onto a C-18 trap column (Dionex), the LC column was washed for 3 min with mobile phase A (2% acetonitrile, 0.1% formic acid) at a flow rate of 30 µl/min. Peptides were eluted using a linear gradient of 0.30% mobile phase B (0.1% formic acid in acetonitrile)/minute for 130 min, then to 95% B in an additional 10 min, all at a constant flow rate of 250 nL/min. Each full MS scan (m/z 375–1800) was followed by seven MS/MS scans (normalized collision energy of 35%) for the 7 most abundant ions. Dynamic exclusion was enabled to minimize redundant selection of peptides previously selected for MS/MS analysis.
Peptide Identification via Database Search for MS/MS
MS/MS spectra were searched against the UniProt human proteome database (10/08 release) from the European Bioinformatics Institute (http://www.ebi.ac.uk/integr8) using SEQUEST (ThermoFisher Scientific) with two variable modifications, +16 Da for methionine oxidation, and +57 for carboxyamidomethylation of cysteine. Peptides were considered legitimately identified if they achieved specific charge state and proteolytic cleavage-dependent cross-correlation (Xcorr) scores of 1.9 for [M+H]1+, 2.2 for [M+2H]2+, and 3.5 for [M+3H]3+, and a minimum delta correlation score (ΔCn) of 0.08. A false peptide discovery rate of approximately 4% was determined by searching the primary tandem MS data using the same criteria against a decoy database wherein the protein sequences are reversed.
LC-MRM/MS/MS Assays for Selected Peptides
Nano LC-MRM/MS/MS experiments were performed on a TSQ Quantum Ultra (ThermoFisher Scientific) coupled with a nanoflow Dionex Ultimate 3000 liquid chromatography system. Solvent A (2 % acetonitrile in 0.1% formic acid) and Solvent B (100% acetonitrile in 0.1% formic acid) were used as the mobile phase. Tryptic digests from 10 µg post-MARS14 depletion serum proteins were resuspended in 20 µl 0.1%TFA and analyzed in triplicate (6 µl for each injection). Nano LC separation was performed using a homemade Jupiter C18 fused silica capillary column (75 µm I.D. × 360 µm O.D. × 20 cm-long) (Polymicro Technologies, Phoenix, AZ). The peptides were eluted at 350 nL/min with a gradient of 0–35% solvent B for 35 min, 35–95% solvent B for 3 min, 95% solvent B for 10 min. The collision energies were calculated using the standard equation CE = 0.034 × m / z + 3.314. FHWM (full width at half maximum) FHWM (full width at half maximum) was set to be 0.7 Da for Q1 and Q3. The dwell time at each transition was 10 ms and the width of detection window for each transition was 1 m/z.
Data Analysis
The spectral counting data were analyzed using a negative binomial regression model to identify proteins with differential abundance. To account for the variation in total spectral counting among different samples, the logarithm of total spectral count was included as the offset in the model as shown in Equation 1:
equation M1
(Equation 1)
Where, X is a dummy variable for the case/control status. The significance of features was based on the p value for coefficient β. To increase the confidence for the legitimacy of spectral counting based relative quantitation, only proteins with a minimum of 2 identified peptides and more than 15 total spectral counts were included for statistical analysis. A SAS (version 9.2; SAS Institute Inc., Cary, NC) macro utilizing the GENMOD procedure was written to facilitate batch-processing of the described negative binomial regression. The false discovery rate was estimated based on the sequential p-value method proposed by Benjamini and Hochberg16 using a MATLAB® (MathWorks Inc., Natick, MA) script. In brief, the observed p values were ordered from minimum (p1) to maximum (pm), and the false discovery rate was estimated to be α associated with pk closest to the p value cutoff used for feature selection, where k = max {k: pk ≤ α *k/m}, and m is the total number of proteins.
Hierarchical clustering was carried out using MATLAB®. The values for spectral counts were standardized for each protein so that each had a mean of 0 and a standard deviation of 1. Both sample distance and protein feature distance were calculated using Pearson’s correlation and average linkage was used for the clustering of both samples and protein features. Bioinformatics analysis was carried out using Ingenuity Pathways Analysis (IPA; Ingenuity Systems Inc., Redwood City, CA) tools.17
LC-MS/MS Analysis of MARS14 Depleted Serum Proteins
A total of 31 different serum pools (9 from adenocarcinoma cases, 6 from squamous cell carcinoma cases, 8 from clinical controls, and 8 from healthy controls) were subjected to MARS14 depletion of highly abundant serum proteins and tryptic peptides of the post-depleted serum proteins were analyzed with LC-MS/MS. A total of 56,314 peptides belonging to 20,916 proteins were identified from the combined 93 LC-MS/MS runs (triplicate runs for each of the 31 pooled samples) after filtering with our criteria specified in the “Materials and Methods” section. Identification of 11,549 proteins (55%) was based on more than one peptide. The total number of spectral counts for each sample after combining triplicate runs varied from 6455 to 9238. There was no significant difference for the total number of spectral counts among the four different pooled sample groups (Kruskal-Wallis test p value of 0.88).
To evaluate the performance of the MARS14 spin cartridge in the removal of the top 14 proteins, we calculated the proportions of the spectral counts for these proteins over the total spectral counts of all identified proteins combining all 93 LC-MS/MS runs. These proportions were compared to the proportion of serum protein mass contributed by these proteins. The contributions to the total spectral counts (Figure 1B) by the majority of these proteins were much lower than their contributions for the total serum protein mass (Figure 1A). The most dramatic performance is for the depletion of serum albumin, which constitutes more than half of the total serum mass but only contributed to 0.3% of the total spectral counts in the depleted samples. However the performance of the MARS14 spin cartridge in the depletion of apolipoprotein A-I and apolipoprotein A-II was very poor; their contributions to the total spectral counts were essentially the same (apolipoprotein A-I) or higher (apolipoprotein A-II) in the depleted samples compared to their contributions to the total serum mass. Overall, we observed that the total proportion of spectral counts for all of these 14 proteins was about 7% in the depleted samples, much lower than the proportion of these proteins in the total serum mass (>95%). These results indicated that, although improvement is needed for some proteins, overall the MARS14 spin cartridge is efficient in the removal of high abundance serum proteins. A number of very low abundant proteins, including several interleukins, tumor necrosis factor, and troponin T were detected during our analyses, probably as a result of the removal of high abundance proteins by the MARS14 spin cartridges.
Figure 1
Figure 1
Effectiveness of MARS14 depletion of high abundance serum proteins. Pie charts show the proportions of serum protein mass (A), or spectral counts (B) contributed by high abundance proteins subjected to MARS14 depletion. The observed spectral counts of (more ...)
Comparative Data Analysis for Identification of Lung Cancer Selective Serum Proteins
In order to identify serum proteins with differential abundance in NSCLC case sera versus controls, we combined the total MS/MS spectra that resulted in legitimate identification of peptides for a given protein accession across all of the pooled samples (spectral counting) and used the resultant spectral counting for the relative quantitation of protein abundance. To increase the confidence of using spectral counting data for relative quantitation, we only included proteins identified based on a minimum of 2 different peptides and with >15 total spectral counts for the combined 93 LC-MS/MS analysis. A total of 1127 proteins remained after this selection. A negative binomial model was used to evaluate whether case/control status (as the independent variable) was a significant predictor for differential protein spectral counts (as the dependent variable). For each protein, 3 negative binomial model based comparisons were made between: 1) all case pools vs. all control pools; 2) all adenocarcinoma pools vs. all control pools; and 3) all squamous cell carcinoma pools vs. all control pools. The rationale for making these three comparisons is to identify not only features that show differential abundance in both subtypes of case pools, but also to include features that show differential abundance in only one histological subtype of NSCLC (adenocarcinoma or squamous cell carcinoma). Applying a p value cutoff of ≤0.01 for at least one of these comparisons yielded a total of 49 candidate proteins (Table 2), with 35 of them up and 14 down in observed abundance for the pooled lung cancer samples comparing the average of all cases to all controls. The false positive rate was estimated to be 25% based on the sequential p-value method proposed by Benjamini and Hochberg (1995), using the minimum p values of the three comparisons as the input p values.16
Table 2
Table 2
List of differentially expressed protein features among adenocarcinoma (ADC), squamous cell carcinoma (SCC), and control pools (p≤0.01).
We also tested the discriminatory power of these selected proteins using unsupervised hierarchical clustering. As shown in Figure 2, the spectral counts for these proteins resulted in near complete separation of the case pools from the control pools with only two exceptions, case pool P15 and clinical control pool P18. The selected features are also robust in separating adenocarcinoma pools from squamous cell carcinoma pools. Within the cluster of case pools (cluster #1 in Figure 2), all 9 adenocarcinoma pools are clustered together (cluster #1a) and 4 out of 5 squamous cell carcinoma pools within cluster #1 are clustered together (cluster #1b). Only one squamous cell carcinoma pool (P11) is clustered together with the adenocarcinoma pools.
Figure 2
Figure 2
Hierarchical clustering based on spectral counting of the selected significant proteins (p≤0.01). Both sample distance (column distance) and protein feature distance (row distance) were based on Pearson’s correlation and the linkage of (more ...)
Bioinformatics Analysis of Selected Features
Ingenuity Pathways Analysis (IPA) tools17 were used to identify the functional attributes of the 49 potential lung cancer selective proteins. The analysis showed significant enrichment of proteins related to a number of different biological functions such as neurological disease, cellular movement, organismal survival, embryonic development, renal and urological disease, cancer, and inflammatory responses, in the list of proteins with differential abundance in lung cancer (Table 3). A total of 19 molecules are associated with organismal survival, cell cycle, and cancer pathways, with ERK1/2, p38 MAP kinase, and NFκB as the key molecules in the network.
Table 3
Table 3
Functional enrichment analysis for NSCLC selective proteins using the Ingenuity Pathway Analysis tools.
Adenocarcinoma and squamous cell carcinoma are two major subtypes of NSCLC. To see whether these two subtypes are related to differential physiological responses, we applied comparative functional enrichment analysis using IPA tools to compare features with differential abundance in the adenocarcinoma case pools, compared to the squamous cell carcinoma case pools. The selection of features was based on a p value ≤0.05. A total of 79 and 68 significant features were selected for adenocarcinoma and squamous cell carcinoma, respectively, with 14 of them being common for both subtypes. Results showed differential enrichment of proteins for several different functional categories for these two NSCLC subtypes (Table 4). In particular, a larger set of adenocarcinoma selective proteins are involved in inflammatory response or have been previously implicated in inflammatory related diseases such as infectious, respiratory, gastrointestinal, and dermatological diseases, suggesting that lung adenocarcinoma patients may experience a greater pro-inflammatory response. In contrast, more squamous cell carcinoma selective features are involved in molecular transport, small molecule biochemistry, endocrine system disorders, and vitamin and mineral metabolism. These results suggest that these two different NSCLC subtypes of lung cancer are associated with different physiological responses.
Table 4
Table 4
Comparative functional enrichment analysis for proteins with altered expression in different NSCLC subtypes (adenocarcinoma versus squamous cell carcinoma) using the Ingenuity Pathway Analysis tools.
Verification of Selected Features Using Quantitative Mass Spectrometry by Multiple Reaction Monitoring (MRM)
To verify the legitimacy of using spectral counting data for relative quantitation of protein abundance, we utilized MRM assays to confirm the spectral counting derived abundance of four selected proteins from our study: serum amyloid A (SAA), alpha-1-acid glycoproteins 1 and 2 (AAG1 and 2), which were observed to have significantly higher spectral counts in the lung cancer versus control serum pools, and clusterin (CLU) which had equivalent spectral counts across the case and control pools. One unique surrogate peptide was selected for each of these proteins and 3 transition ions were monitored for each peptide (Table 5). The MRM assays were carried out in triplicate for 4 selected adenocarcinoma lung cancer serum pools and 5 selected PLuSS control pools and the mass chromatogram peak areas from triplicate runs were averaged for each peptide/sample pair. Figure 3 shows the scatterplot distribution of the logarithm of average MRM peak areas (base 10) according to the case/control status of the pool. The mean spectral counts and MRM peak areas are listed in Table 3. Similar to what we observed based on spectral counting derived abundance, the MRM peak areas for SAA, AAG1 and AAG2 were much higher in lung cancer serum pools compared to control pools, whereas similar MRM peak areas were seen for the selected CLU peptide. Comparative analysis utilizing Student’s t test for MRM peak areas (after log transformation) indicated that the difference between the selected case and control pools remains statistically significant for SAA1 and AAG1 and of marginal significance for AAG2 (Table 5). Overall, these results indicate strong correlation between spectral counting derived abundance and MRM peak area derived abundance, supporting the validity of using spectral counting derived protein abundance to identify differentially abundant proteins.
Table 5
Table 5
Comparative analysis of protein abundance measurements by spectral counting and MRM.
Figure 3
Figure 3
MRM analysis and quantitation of clusterin (CLU), serum amyloid A (SAA), alpha-1-acid glycoprotein 1 (AAG1), and alpha-1-acid glycoprotein 2 (AAG2). Spectral counting data indicated that the SAA, AAG1 and AAG2 were differentially abundant between the (more ...)
Toward our overall goal to identify potential serum based biomarkers to predict lung cancer in high risk subjects, including patients with CT detected pulmonary nodules, we applied bottom-up LC-MS/MS based proteomic analyses to a large set of pooled clinically ascertained NSCLC case sera and matched controls. Pooled sera were subjected to MARS14 depletion, followed by in solution trypsin digestion, and LC-MS/MS analysis. A large number of low abundant proteins were identified, suggesting relatively broad coverage of the serum proteome. By comparing the proportions of the 14 MARS14 depleted serum proteins in serum protein mass versus total spectral counts, we found differential effectiveness of MARS14 spin cartridges in depletion of different proteins, with depletion efficiency highest for serum albumin. The differential effectiveness probably reflects the quality and quantity of antibodies utilized in the MARS14 spin cartridges.
The relative quantitation of protein abundance in our study was achieved via spectral counting. To increase reliability of the spectral counting, we analyzed our samples in triplicate and only included proteins with ≥2 different peptides and with >15 total spectral counts. We recognize that this increase in reliability comes with a substantial sacrifice with a large number of proteins excluded from quantitation. Of the total of 20,916 proteins identified from the combined 93 LC-MS/MS runs, 11,549 proteins were identified with ≥2 different peptides; however only 1127 of them had >15 total spectral counts.
Although spectral counting has gained increased popularity for relative quantitation of protein abundance in label-free LC-MS/MS workflows, statistical tools for spectral counting based comparative analysis are still immature. Since spectral counting is count data in nature, Poisson regression has been utilized by some researchers for comparative analysis of spectral counting based data.18,19 However, Poisson regression assumes that the dependent variable (herein spectral counting) follows the Poisson distribution and therefore the variance is equal to the mean. We tested the performance of Poisson regression on our dataset and found that about 10% of the resultant Poisson regression models had deviance/degree of freedom (df) >2, indicating the presence of over-dispersion for some proteins (data not shown). Therefore we decided to use a negative binomial regression model, a useful alternative for data over-dispersed for the Poisson regression model by incorporating an additional parameter to adjust variance independently of the mean.13 We applied negative binomial regression to the 1127 proteins with ≥2 different peptides and with >15 total spectral counts. A robust discovery set (N=49) of lung cancer selective proteins was identified. We also compared the negative binomial analysis results to those obtained using the non-parametric Wilcoxon rank sum test. The same feature selection criteria using the Wilcoxon rank sum test yielded a smaller set of significant features (N=40), with 28 of them also detected by negative binomial regression. A number of the selected proteins have been identified in the previously referenced proteomics studies,9,10 such as alpha-1-acid glycoprotein 1, N-acetylmuramoyl-L-alanine amidase, gesolin, haptoglobin, ficolin-3, and beta-ala-his dipeptidase. Consistent with the important role of inflammation in tumor development,20 functional analysis by IPA showed a significant enrichment of proteins associated with inflammatory responses or acute phase reaction.
The robustness of the spectral counting derived abundance for these selected proteins in discriminating case pools and control pools was tested using unsupervised hierarchical clustering, a statistical method for grouping subjects based on similarity of measured characteristics. The resulting analysis showed almost complete separation between the NSCLC case pools and control pools with the exception of only one case pool (P15) and one clinical control pool (P18). However it is worth mentioning that pool P15, although being clustered together with the control pools, also shared substantial similarity in terms of protein abundance with the sub-cluster (cluster #1b) consisting of 4 other squamous cell carcinoma pools. Also worth noting is the fact that many of the clinical control subjects, although cancer free at the time of serum sample collection, had non-malignant lung disease as demonstrated by the presence of CT detected pulmonary nodules. The mixed-clustering of the clinical control pool P18 with the NSCLC case pools may therefore be the result of the contributions of inflammatory proteins included in our list of significant features. Nonetheless, our analysis was able to separate the majority of clinical control pools (7 of the 8 pools) from NSCLC case pools, suggesting that our selected features may potentially be useful in increasing the specificity of thoracic CT scans for lung cancer screening, thus reducing the need of invasive thoracic procedures for patients with CT detected indeterminate nodules.
Adenocarcinoma and squamous cell carcinoma are the two most common NSCLC subtypes. These two NSCLC subtypes have been shown to differ significantly both in terms of their clinical behavior and molecular signatures.21,22,23 Interestingly, by comparing significant features selected based on the comparison of adenocarcinoma case pools to all control pools, versus significant features selected based on the comparison of squamous cell carcinoma case pools to all control pools using comparative functional analysis with IPA tools, we found differential enrichment of several functional categories for these two NSCLC subtypes. Lung adenocarcinoma patients appear to experience a greater pro-inflammatory response based on the observed alterations in the serum proteome. More inflammatory response related proteins are differentially abundant in adenocarcinoma case pools compared to squamous cell carcinoma case pools. Significantly higher numbers of adenocarcinoma selective proteins have been previously associated with several inflammatory related conditions such as infectious, respiratory, gastrointestinal, and dermatological diseases. Also, among the inflammatory related proteins differentially expressed in both of these NSCLC subtypes, such as AAG1, AAG2 and SAA, the magnitude of change appears to be larger for the adenocarcinoma case pools (Table 2). In contrast, squamous cell lung carcinoma patients seem to experience a greater response in plasma lipid physiology, with more differentially abundant proteins involved in molecular transport and small molecule biochemistry. Both of these two functional categories involve proteins responsible for transport, secretion, clearance, and/or metabolism of lipids. Squamous cell carcinoma patients also seem to have a greater deregulation in their endocrine system, with a greater number of differentially abundant proteins associated with endocrine system disorders. We also observed a greater magnitude of change in squamous cell lung carcinoma for endocrine system related features shared by both subtypes, such as neuregulin-2, apolipoprotein B-100, and beta-ala-his dipeptidase (Table 2). Interestingly, the majority of these proteins (18 out of 19) have been previously related to diabetes mellitus. Smoking has been shown to contribute to endocrine disorders including the development of insulin resistance and hence type 2 diabetes mellitus.24 Since squamous cell carcinoma is the more frequent NSCLC subtype diagnosed in heavy smokers, it is not clear whether our observation is a reflection of more heavy smoking in the squamous cell carcinoma cases in our study.
In summary, we have presented data from an initial comparative proteomics study for lung cancer serum biomarker discovery and generated a robust discovery set of candidate proteins with altered abundance in lung cancer. Functional analyses support the important roles of inflammation in cancer and also reveal differential physiological responses related to different NSCLC subtypes. The clinical utility of these candidate lung cancer serum biomarker proteins needs to be validated with additional analytical platforms as well as in independent case/control sample sets.
Supplementary Material
Acknowledgment
Disclosure of funding: This research was supported by Hirtzel Foundation Postdoctoral Fellowship to XZ, P50 CA90440 NCI SPORE in Lung Cancer to JMS, and NCI Early Detection Research Network Biomarker Discovery Laboratory, U01 CA084968 to WLB.
Footnotes
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1. Okada M, Nishio W, Sakamoto T, et al. Effect of tumor size on prognosis in patients with non-small cell lung cancer: the role of segmentectomy as a type of lesser resection. J Thorac Cardiovasc Surg. 2005;129:87–93. [PubMed]
2. Smith RA, Cokkinides V, Brooks D, et al. Cancer screening in the United States, 2010: a review of current American Cancer Society guidelines and issues in cancer screening. CA Cancer J Clin. 2010;60:99–119. [PubMed]
3. Kaneko M, Eguchi K, Ohmatsu H, et al. Peripheral lung cancer: screening and detection with low-dose spiral CT versus radiography. Radiology. 1996;201:798–802. [PubMed]
4. Welch HG, Woloshin S, Schwartz LM, et al. Overstating the evidence for lung cancer screening: the International Early Lung Cancer Action Program (I-ELCAP) study. Arch Intern Med. 2007;167:2289–2295. [PubMed]
5. Wilson DO, Weissfeld JL, Fuhrman CR, et al. The Pittsburgh Lung Screening Study (PLuSS): outcomes within 3 years of a first computed tomography scan. Am J Respir Crit Care Med. 2008;178:956–961. [PMC free article] [PubMed]
6. Buccheri G, Torchio P, Ferrigno D. Clinical equivalence of two cytokeratin markers in mon-small cell lung cancer: a study of tissue polypeptide antigen and cytokeratin 19 fragments. Chest. 2003;124:622–632. [PubMed]
7. Pastor A, Menéndez R, Cremades MJ, et al. Diagnostic value of SCC, CEA and CYFRA 21.1 in lung cancer: a Bayesian analysis. Eur Respir J. 1997;10:603–609. [PubMed]
8. Rapellino M, Niklinski J, Pecchio F, et al. CYFRA 21-1 as a tumour marker for bronchogenic carcinoma. Eur Respir J. 1995;8:407–410. [PubMed]
9. Okano T, Kondo T, Kakisaka T. Plasma proteomics of lung cancer by a linkage of multi-dimensional liquid chromatography and two-dimensional difference gel electrophoresis. Proteomics. 2006;6:3938–3948. [PubMed]
10. Heo SH, Lee SJ, Ryoo HM, et al. Identification of putative serum glycoprotein biomarkers for human lung adenocarcinoma by multilectin affinity chromatography and LC-MS/MS. Proteomics. 2007;7:4292–4302. [PubMed]
11. Liu H, Sadygov RG, Yates JR., 3rd A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem. 2004;76:4193–4201. [PubMed]
12. Old WM, Meyer-Arendt K, Aveline-Wolf L, et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol Cell Proteomics. 2005;4:1487–1502. [PubMed]
13. Gardner W, Mulvey EP, Shaw EC. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychol Bull. 1995;118:392–404. [PubMed]
14. Anderson L, Hunter CL. Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell Proteomics. 2006;5:573–588. [PubMed]
15. Kuzyk MA, Smith D, Yang J, et al. Multiple reaction monitoring-based, multiplexed, absolute quantitation of 45 proteins in human plasma. Mol. Cell Proteomics. 2009;8:1860–1877. [PubMed]
16. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57:289–300.
17. The networks and functional analyses were generated through the use of Ingenuity Pathways Analysis (Ingenuity® Systems, www.ingenuity.com).
18. Chourey K, Thompson MR, Shah M, et al. Comparative temporal proteomics of a response regulator (SO2426)-deficient strain and wild-type Shewanella oneidensis MR-1 during chromate transformation. J Proteome Res. 2009;8:59–71. [PubMed]
19. Sprung RW, Jr, Brock JW, Tanksley JP, et al. Equivalence of protein inventories obtained from formalin-fixed paraffin-embedded and frozen tissue in multidimensional liquid chromatography-tandem mass spectrometry shotgun proteomic analysis. Mol Cell Proteomics. 2009;8:1988–1998. [PMC free article] [PubMed]
20. Grivennikov SI, Greten FR, Karin M. Immunity, inflammation, and cancer. Cell. 2010;140:883–899. [PMC free article] [PubMed]
21. Ginsberg MS, Grewal RK, Heelan RT. Lung cancer. Radiol Clin North Am. 2007;45:21–43. [PubMed]
22. Hou J, Aerts J, den Hamer B, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS One. 2010;5:e10312. [PMC free article] [PubMed]
23. Landi MT, Zhao Y, Rotunno M, et al. MicroRNA expression differentiates histology and predicts survival of lung cancer. Clin Cancer Res. 2010;16:430–441. [PMC free article] [PubMed]
24. Kapoor D, Jones TH. Smoking and hormones in health and endocrine disorders. Eur J Endocrinol. 2005;152(4):491–499. [PubMed]