|Home | About | Journals | Submit | Contact Us | Français|
We used a lectin chromatography/MS-based approach to screen conditioned medium from a panel of luminal (less aggressive) and triple negative (more aggressive) breast cancer cell lines (n = 5/subtype). The samples were fractionated using the lectins Aleuria aurantia (AAL) and Sambucus nigra agglutinin (SNA), which recognize fucose and sialic acid, respectively. The bound fractions were enzymatically N-deglycosylated and analyzed by LC-MS/MS. In total, we identified 533 glycoproteins, ~90% of which were components of the cell surface or extracellular matrix. We observed 1011 glycosites, 100 of which were solely detected in ≥3 triple negative lines. Statistical analyses suggested that a number of these glycosites were triple negative-specific and thus potential biomarkers for this tumor subtype. An analysis of RNAseq data revealed that approximately half of the mRNAs encoding the protein scaffolds that carried potential biomarker glycosites were upregulated in triple negative vs. luminal cell lines, and that a number of genes encoding fucosyl- or sialyltransferases were differentially expressed between the two subtypes, suggesting that alterations in glycosylation may also drive candidate identification. Notably, the glycoproteins from which these putative biomarker candidates were derived are involved in cancer-related processes. Thus, they may represent novel therapeutic targets for this aggressive tumor subtype.
The intense interest in biomarker discovery is a reflection of the clinical need for tests with a high degree of sensitivity and specificity for diagnosing diseases, predicting their courses, as well as monitoring responses to therapy and disease recurrence. Technological breakthroughs in separation strategies and mass spectrometry (MS) have enabled rapid identification and quantification of large numbers of proteins in biological samples 1. Nonetheless, their complexity requires extensive fractionation to access low abundance proteins, such as those released from nascent tumors. Alternatively, and technically less challenging, is the design of capture approaches that exploit disease biology for the purpose of biomarker identification 2. For many reasons, glycosylation is an attractive target. First, the biology allows for the rational design of discovery efforts. For example, changes in the glycosylation machinery can be identified from microarray data and translated in structural terms, providing a compelling rationale for designing lectin-based strategies to enrich glycopeptides carrying disease-related carbohydrate motifs. Second, one protein can carry many copies of an altered glycan, which may also be added to other scaffolds. Thus, there is an important amplification effect, which could enable the detection of many fewer abnormal cells than would otherwise be possible. Finally, glycosylation acts to shield the peptide backbone from proteolytic degradation 3. Thus, in theory, glycan-based biomarkers are likely to be more stable in a variety of disease settings than unmodified proteins, which are often more labile.
Glycosylation is altered in a number of pathologies, but its relationship to cancer is particularly well-defined at phenotypic and, to a lesser degree, functional levels. For example, many of the most widely used clinical tests detect glycoproteins and carbohydrate structures. These include carcinoembryonic antigen (CEA), commonly used as a marker of colorectal cancer; CA-125, frequently employed to diagnose ovarian cancer; CA 19-9, the most commonly used biomarker for diagnosing pancreatic cancer; CA 15-3, used to monitor the metastasis of breast cancer 4; and prostate-specific antigen (PSA) 5–8. In addition, glycan-specific antibodies and lectins are used for the cytological and histological evaluation of glycosylation for the purpose of guiding diagnoses and enabling more accurate prognoses, e.g., anti-Lewis (Le)x antibodies for bladder cancer, and the lectins Helix pomatia agglutinin (HPA) and Ulex europaeus I agglutinin (UEA 1) for breast cancer 1. This is due to the fact that increases in fucosylation and sialylation of N-linked structures and truncation of O-linked oligosaccharides occur in many tumor types. The expression of Le antigens, such as sialyl Lex, can also be indicative of disease progression, as these structures play important roles in promoting metastasis by virtue of their well-known ability to mediate cell trafficking and extravasation 9, 10.
Breast cancer is now recognized to be a collection of distinct neoplastic diseases with different molecular and clinical attributes. Breast tumors can be stratified into five intrinsic subtypes and a “normal-like” group according to features such as mRNA expression 11. Interestingly, these molecularly-defined cohorts, which include luminal, basal-like, and claudin-low, are also predictive of clinical outcomes such as disease severity and treatment response 12–14. Specifically, luminal tumors tend to be less aggressive with better survival rates, while basal-like and claudin-low lesions have generally worse prognoses 15. Additionally, the expression of a therapeutic target such as the estrogen receptor (ER) or human epidermal growth factor receptor 2 (HER2/ErbB2) determines tumor susceptibility to drugs that interact with these molecules 16, 17. Triple negative breast cancers (TNBC) express neither ER nor the progesterone receptor (PR) and moderate levels of HER2. This clinically important, heterogeneous category includes most basal-like and claudin-low tumors 18, 19. TNBCs have poor survival rates and lack specific therapeutic targets, limiting treatment options and making early detection a priority.
We hypothesized that biomarkers specific for these tumors could be identified by a comparative analysis of the repertoire of secreted or shed glycoproteins in a panel of breast cancer cell lines that have been extensively characterized at genomic and transcriptional levels 20–22. Based on gene expression, the lines can be clustered into subsets that mirror the molecular characteristics of primary breast tumors. Thus, these panels are useful tools for studying subtype-specific behavior, such as drug responses and alternative splicing 20, 23. Here, we used a subset of cells from this collection for biomarker discovery. Specifically, we analyzed conditioned medium (CM) from 5 luminal and 5 triple negative cell lines. The samples were distributed to three laboratories: University of California San Francisco (UCSF), the Buck Institute for Research on Aging, and Purdue University. Each group analyzed the samples using our recently published method for lectin affinity chromatographic enrichment and LC-MS/MS analyses 24. Overall, we identified 533 glycoproteins, including 1011 N-linked glycosylation sites (glycosites). Of these, 100 were solely detected in ≥3 triple negative lines. Interestingly, many in the latter category were from glycoproteins that are upregulated in the claudin-low subtype 21, involved in cancer progression (e.g., epithelial to mesenchymal transition) and/or metastasis, 25.
All cells were cultured as described in Neve et al. 21. To generate the CM, we cultured 10 breast cancer cell lines (Table 1) that were derived from 5 luminal (SKBR3, SUM52 PE, MDAMB175, UACC 812, and MDAMA361) and 5 triple negative tumors (MDA468, BT549, HS578T, MDAMB231, and HCC38). CM was prepared and trypsin digested at Site M. The lines were grown to 75–80% confluence in the appropriate culture medium 21. Then they were washed with fresh medium without fetal calf serum (FCS) or phenol red and incubated for 10 min at 37 °C. This process was repeated twice before the cells were incubated in fresh medium (without FCS and phenol red) for 18–20 h. At the end of the culture period, the cells retained their original morphologies with no evidence of apoptosis. The CM was harvested and centrifuged at 2000 × g for 10 min. The supernatant was concentrated using Millipore centrifugal filter units (MWCO 3K) and dialyzed against phosphate buffered saline (PBS).
Biotinylated and fluoresceinated lectins were purchased from Vector Laboratories. Blotting: Cell lysates were separated by SDS-PAGE (4–12% gels) and transferred to nitrocellulose membranes. Unless otherwise indicated, the following buffer was used for all steps, including blocking, washing, and reagent dilution/incubation: 0.25 M Tris-Cl, pH 8.0, 0.5 M NaCl, 0.5% NP-40. Blots were incubated in buffer for 1 h to block non-specific binding, then exposed to a solution of ~5 μg/mL of biotinylated lectin for 2 h. Blots were washed 3 × 5 min with copious amounts of buffer. Then, membranes were reacted with ABC reagent (Vector Laboratories) for 1 h and washed again as before. Finally, bound lectin was detected using 3,3-diaminobenzidine (DAB, Vector Laboratories) prepared in water according to the manufacturer’s instructions. Staining: cell surface labeling of non-permeabilized cells was performed as described 26, except that fluoresceinated lectins, rather than antibodies, were used.
First, protein concentrations of the CM samples were determined by amino acid analysis. Then, CM samples were digested and desalted using a published method that incorporates denaturation with 6 M urea 27. As previously described 24, samples were spiked with 25 and 50 pmol of trypsin-digested control glycopeptides from commercial yeast invertase and human lactoferrin (Sigma, St. Louis, MO), respectively. Peptides were stored at −80 °C prior to analyses.
The columns were prepared at Site M from a single batch of lectin-conjugated beads and distributed to all the laboratories. Briefly, Sambucus nigra agglutinin (SNA) and Aleuria aurantia lectin (AAL) were purchased from Vector Laboratories (Burlingame, CA). Lectins (20 mg) were suspended at 5–10 mg/mL in PBS and conjugated to 330 mg of POROS-AL beads (Applied Biosystems, Foster City, CA) as previously described 24. Unconjugated protein was removed by washing the beads (5 × 5 mL of 1 M sodium chloride) before they were packed into 3 individual 4.6 × 50 mm PEEK HPLC columns. Routine storage was in PBS with 0.02% sodium azide at 4 °C for up to 6 months. Columns were reused for up to 75 affinity separations without degradation of the performance characteristics as assessed by glycopeptide enrichment and total number of glycopeptides recovered from digested human plasma.
The HPLC systems employed were standardized in terms of injection volume, transfer line lengths, dead volume minimization, and common UV elution profiles. Site M used a Paradigm MG4 HPLC system equipped with a CTC PAL robot configured as an autosampler and fraction collector (Michrom Bioresources). At Site X, a Waters system including 1525 Binary HPLC equipped with a 717 plus Autosampler and a Fraction Collector III was employed. Site S used a Shimadzu 20AD HPLC system equipped with a SIL-20AC autosampler; fractions were collected manually. Mobile phases: Buffer A was 25 mM Tris buffer, pH 7.4, 50 mM sodium chloride, 10 mM calcium chloride, and 10 mM magnesium chloride; Buffer B was 0.5 M acetic acid. Affinity separation: Routinely, ~100 μg of digested protein was diluted into Buffer A, applied to the lectin column, and separated using the following 3 step gradient: 1) Sample load: Buffer A for 9.0 min at 80 μL/min; 2) Sample elution: Buffer B for 4.8 min at 500 μL/min; and 3) Re-equilibration: Buffer A for 6.0 min at 3000 μL/min. The bound fraction, collected from 9.0 to 14.25 min, was desalted using Oasis HLB cartridges as described above. Eluted samples were neutralized by the addition of 0.5 M ammonium bicarbonate and concentrated to <100 μL by vacuum centrifugation. Further details are described in the accompanying SOP (Supplementary Document 1).
N-linked glycopeptides in the bound fractions were deglycosylated by treatment with PNGase F (Glycerol-free, New England Biolabs; Ipswich, MA) as previously described 24. Following deglycosylation, samples were desalted and concentrated using C18 ZipTips® (Millipore; Billerica, MA) or MicroSpin Columns, 5–200 μL (The Nest Group, Inc.; Southborough, MA).
The peptides were separated using an Eksigent nano-LC 2D HPLC system (Eksigent, Dublin, CA), which was directly connected to a quadrupole time-of-flight (QqTOF) QSTAR Elite mass spectrometer (AB Sciex, Foster City, CA). We injected 33% (vol/vol) of the bound material per run. Briefly, peptides were applied to a guard column (C18 Acclaim PepMap100, 300 μm I.D. × 5 mm, 5 μm particle size, 100 Å pore size; Dionex, Sunnyvale, CA) and washed with the aqueous loading solvent (2% solvent B in A, flow rate: 20 μL/min) for 10 min prior to separation on a C18 Acclaim PepMap100 column (75 μm I.D. × 15 cm, 3 μm particle size, 100 Å pore size; Dionex, Sunnyvale, CA). Bound material was eluted at a flow rate of 300 nL/min using the following gradients: 2–40% solvent B in A (from 0–60 min), 40–90% solvent B in A (from 60–75 min), and at 90% solvent B in A (from 75–85 min), with a total runtime of 120 min (including column equilibration). Solvent A consisted of 0.1% formic acid in 98% H2O/2% acetonitrile and solvent B was 0.1% formic acid in 98% acetonitrile/2% H2O. Spectra were calibrated using MS/MS fragment-ions of a Glu-Fibrinogen B peptide standard. Advanced information dependent acquisition was employed for MS/MS data collection using QSTAR Elite (Analyst QS 2.0) specific features, including “Smart Collision” (fragment intensity multiplier set to 2.0) and “Smart Exit” (maximum accumulation time of 2.5 sec) to obtain MS/MS spectra for the six most abundant precursor ions following each survey scan. To increase overall sampling efficiencies, two replicate nano-HPLC-MS/MS analyses were performed per sample.
The peptide mixtures were separated as described above using an Agilent nanoflow 1100 HPLC system (Agilent, Santa Clara, CA) connected to a hybrid linear ion trap Orbitrap mass spectrometer (LTQ Orbitrap XL, Thermo Fisher Scientific). The electrospray ionization emitter tip (Pico-tip emitter, F360-75-15-N-5-C10.5) was purchased from New Objective (Woburn, MA). The mass spectrometer, which was calibrated with a solution of caffeine, MRFA and Ultramark 1621 according to the manufacturer’s instructions, was operated in the data-dependent mode. Full MS scans from m/z 350 to 1600 with a full width at half maximum resolution of 30,000 were acquired as profile data, followed by MS/MS scans of the six most abundant ions in the linear trap. Singly charged ions were excluded. A dynamic mass exclusion time was applied for 120s with a repeat count of 1 and a repeat duration time of 30s. In all scan modes, one micro scan was applied.
Mass spectrometric data from all laboratories were analyzed at Site M using two bioinformatics database search engines with integrated peak picking, ProteinPilot™ (AB Sciex) version 4.0.8085 (revision 148085) using the Paragon Algorithm 22.214.171.124, 148083 28, and Mascot version 2.2.04 using Mascot Daemon version 2.2.2 (both Matrix Science). For the latter, the following (default) data import filter options were used: precursor charge state +2 to +4, reject spectrum if < 7 peaks or if precursor is < 400 or >10000 m/z, remove peaks with intensity < 0.001% of the highest peak; centroid all MS/MS data, percentage height 50, and merge distance 0.1 atomic mass units. Peak lists for the Orbitrap LC-MS/MS data sets were generated using Mascot Distiller 126.96.36.199 (Matrix Science) with the supplied processing parameter file Orbitrap_low_res_MS2_4.opt. The Orbitrap peak lists were saved in MGF format with Distiller preferences set to save MS/MS peaks as MH+ for input into Mascot and ProteinPilot search engines. All data were searched using a merged database of 20293 protein sequences including the publicly available human SwissProt UniProt release 2010_09 plus 7 other proteins, which includes all 20,286 reviewed (formerly SwissProt) Human Uniprot Entries, as well as PNGase F (Q9XBM8|Q9XBM8_FLAME, P21163|PNGF_ELIMR) and Yeast Invertase (P10594|INV1_YEAST, P00724|INV2_YEAST, P10595|INV3_YEAST, P10596|INV4_YEAST, P10597|INV5_YEAST). ProteinPilot searches were performed as previously described 24. A ProteinPilot peptide confidence cut-off value of 98.8 was chosen, corresponding to a local FDR of 5%. For Mascot searches, the following parameters were used: trypsin enzyme specificity, carbamidomethyl (Cys) as a fixed modification, and the following variable modifications: deamidation of asparagine and glutamine residues, oxidization of methionines, acetylation at the protein N-terminus, cyclization of N-terminal glutamines, and two missed tryptic cleavages. For QSTAR Elite data a mass tolerance of 100 ppm and 0.4 Da was set for the precursor and product ions, respectively; whereas values of 10 ppm and 0.8 Da were applied to Orbitrap data. Peptide-spectral matches with expectation values <0.026 were accepted. FDR analysis was performed using the Mascot automatic decoy search. In all cases, the peptide false-positive identification rate was <3%.
Deglycosylated peptides were identified as previously described 24, on the basis of several criteria including the motif NxS/T, x ≠ proline, in which Asn was converted to Asp (reported by the search engine as Asn deamidation), and the presence of at least one fragment ion encompassing the glycosite. To ensure inclusion of glycosites containing Lys and/or Arg in the X position (e.g., NKT), which were likely to have been cleaved by trypsin, the amino-acid residue following the carboxy-terminal cleavage site was also considered. Peptides containing the motif NGS or NGT were excluded due to the fact that asparagine residues in that sequence are prone to chemical deamidation during overnight trypsin digestion 29. For all deglycosylated peptides the corresponding MS/MS spectra were manually examined using an adaptation of previously published criteria to ensure correct assignment 24, 30.
The selection criteria for triple negative-specific glycosites were subjected to a resampling, non-parametric statistical test in which no knowledge about the data’s distribution is necessary, e.g., the “bootstrap” technique 31. The basic premise of this approach is to consider the null hypothesis that there is statistically no difference between the luminal and triple negative data sets, e.g., that the two are random selections from the same population. To determine the expected FDRs, we applied 20,000 random permutations to the form:
Criterion n-m: A glycosite satisfies criterion n-m if it is identified in ≤ n Luminal cell lines and in ≥ m TN cell lines.
The results are shown in Supplementary Table 3.
An interactive Skyline spectral library file that contains all MS/MS spectra of deglycosylated peptides identified in this study been submitted as Supplemental Material. Skyline is an open source program 32 available for free download at http://proteome.gs.washington.edu/software/skyline.
Whole transcriptome shotgun sequencing (RNAseq) was completed on nine of ten breast cancer cell lines (BT549, HCC38, HS578T, MDAMB231, MDAMB175VII, MDAMB361, SKBR3, SUM52PE and UACC812). Expression analysis was performed with the ALEXA-seq software package as previously described 33. On a per sample basis, an average of 58.7 million (76bp paired-end) reads passed quality control, and 37.6 million mapped to the transcriptome, which resulted in coverage of 40x across all known genes. Log2 transformed estimates of gene-level expression were extracted for fucosyl- and sialyltransferase genes, and triple negative candidate biomarker targets that emerged from the N-glycosite workflow. Corresponding values indicating whether expression of a transcript was detected above background were also extracted. A 2-sided Student’s t-test was used to compare log2 transformed gene expression levels between the five luminal and the four triple negative cell lines. This comparison generated raw p-values, which were then adjusted for multiple comparisons using the Benjamini-Hochberg method for controlling FDRs 34. The adjustment was achieved with the p.adjust(pvals,”fdr”) function in R version 2.12.1 (2010-12-16). Adjusted FDR p-values lower than 10% (0.1) were considered significant.
These experiments utilized a lectin chromatography, MS-based approach that we recently optimized and published to identify candidate cancer biomarkers 24. Initially, we probed nitrocellulose transfers of electrophoretically-separated cell lysates of breast cancer lines established from triple negative and luminal tumor subtypes with a panel of nine lectins (SNA, AAL, Vicia villosa, Phaseolus vulgaris leukoagglutinating and erythroagglutinating, Galanthus nivalis, Euonymus europaeus, Lycopersicon esculentum, and Arachis hypogaea) that recognized either internal saccharide motifs or terminal sugars. The results showed that SNA (Fig. 1a) and AAL (data not shown), which bind motifs with sialic acid and fucose, respectively, reacted with a wide array of glycoproteins. Additionally, some glycoforms were enriched in lines that were derived from the tumors of the same subtype. Staining of intact non-permeabilized cells with fluorescein-conjugated SNA revealed strong surface labeling (Fig. 1b). Together, these results suggested that the breast cancer cell lines produced a large repertoire of glycoproteins that reacted with SNA or AAL, including cell-surface molecules poised to be shed or released.
Next, we used this workflow to compare CM samples from 5 luminal and 5 triple negative breast cancer cell lines to identify subtype-specific glycosites. The cells, listed in Table 1, are members of a well-annotated collection that have been used to define the gene expression profiles, drug sensitivities, and protein splicing patterns of the tumor types from which they were derived 20, 21, 23. Contrary to many other lectin-based approaches, the affinity capture step was performed at the glycopeptide, rather than the protein level, which decreased non-specific binding due to hydrophobic interactions, a phenomenon that we previously observed between lectins and intact proteins. Thus, the samples were trypsin-digested prior to HPLC separation on lectin-conjugated POROS. Then, the bound fraction was treated with peptide N-glycosidase F (PNGase F) to remove N-linked glycans prior to LC-MS/MS analyses. The results were analyzed using two search engines, ProteinPilot and Mascot, to identify peptides and their corresponding proteins 28. N-glycosylates were identified as described in the methods 29. Finally, each MS/MS spectrum was manually inspected for the presence of at least one fragment ion that encompassed an N-glycosylation site. Thus, this method identified the glycosite that carries an oligosaccharide with a lectin-binding motif and the corresponding protein. These rigorous criteria were key to making this method highly reproducible 24.
We know from our participation in the Clinical Proteomic Technologies for Cancer (CPTAC) network that analysis of the same sample at multiple sites on different platforms is one way to maximize identifications and test the robustness of a workflow 35, 36. The experimental strategy we used, which exploited this observation, is depicted in Fig. 2. CM samples were trypsin-digested and aliquoted at a single site (Fig. 2A). Lectin enrichment and LC-MS/MS analyses were carried out according to a Standard Operating Procedure (SOP, Supplemental Document 1) at each of three locations—University of California San Francisco, Buck Institute for Research on Aging, and Purdue University (Fig. 2B). Prior to initiating the study, each group evaluated the lectin capture step using a National Institute of Standards and Technology (NIST) human pooled plasma sample, which we have extensively characterized with respect to the SNA and AAL chromatographic profiles and the glycosite composition of the bound fractions 24. MS analyses yielded glycosite identifications and percent enrichment values (total glycopeptides/total peptides) within the expected range 24.
Two groups, M and X, acquired data using a QSTAR Elite QqTOF (AB Sciex), while the third, S, used an LTQ-Orbitrap (Thermo Fisher Scientific). The datasets were submitted to Site M, where all the searches and bioinformatic analyses were completed (Fig. 2C). As the work progressed, two changes to the protocol were implemented. First, due to technical problems encountered during the initial analysis, a second preparation of CM samples was analyzed at two of the three locations (M and S). Second, sites M and S replaced ZipTips® with spin-cartridges for the desalting step that followed PNGase F digestion. This change was made in response to the fact that, in initial experiments, Site S routinely identified significantly more glycosites using this desalting method. All peptides and glycopeptides observed in these experiments are presented as supplemental data (Supplementary Table 1).
We tabulated the MS identifications according to the CM samples in which they were detected. Summaries of the data, including the number of glycoproteins, glycopeptides and N-glycosites observed in each CM sample, and the percent glycopeptide enrichment, are shown in Figs. 3 and and4,4, and in Supplementary Table 2. Overall the three groups identified a total of 1011 distinct N-glycosites from 533 glycoproteins. Of these, 945 and 641 were observed following AAL and SNA chromatography, respectively. Interestingly, the same workflow applied to pooled healthy human plasma resulted in many fewer identifications. Approximately half the species captured from CM bound to both lectins; the remainder preferentially interacted either with AAL or SNA. (Fig. 3A). A similar phenomenon was observed when the N-glycosites were grouped according to tumor subtype (Fig. 3B and C). Thus, it was clear that employing multiple lectins in our workflow resulted in a greater number of identifications. Furthermore, the data showed that the luminal and triple negative samples contained substantially different lectin-reactive species.
An overall comparison of the data obtained for luminal and triple negative samples across the three sites showed relatively high levels of enrichment in both cases (Fig. 4). Importantly, very few intracellular proteins were identified, additional evidence that the cells were not undergoing apoptosis. Approximately 90% of the glycoproteins observed reside either at the cell surface (59%) or in the extracellular matrix (29%), suggesting that our strategy of using CM as a source of secreted and/or shed glycoproteins was successful (Fig. 5). Since we wanted to identify candidate cancer biomarkers, we were interested to find that a number of the identified species have functions that are relevant to tumor biology. For example, we observed proteinases, including cathepsins and ADAM family members; adhesion molecules, including cadherins and integrins; extracellular matrix components, including decorin and SPARC; and cytokines, including leukemia inhibitory factor and vascular endothelial growth factor C. Furthermore, some of the glycoproteins had been previously identified as putative breast cancer biomarkers, including CD44, galectin-3 binding protein, insulin-like growth factor binding protein 3, and tissue inhibitor of metalloproteinase 1 37–39. We also identified clinically useful markers, such as HER2/ErbB2, and the CA-125 antigen, MUC16, which is commonly used to screen for ovarian cancer, but can be also be upregulated in breast tumors 40, 41.
Next, we used statistical analyses to generate a list of putative triple negative-specific glycosites. Specifically, we performed a statistical analysis using resampling methods that tested 20,000 random permutations of the data. This process generated a table (Supplementary Table 3) with the number of “triple negative-specific” glycosites expected at random for any given set of selection criteria (e.g., observed in “≥1 triple negative and 0 luminal” or “≥4 triple negative and 1 luminal”). This analysis allowed us to select parameters that maximized the identification of putative triple negative specific glycosites while controlling the FDR. In this context, we required that a glycosite be identified at least once in CM samples from ≥3 triple negative cell lines and not observed in luminal CMs. Using these criteria, the computed FDR for both lectin capture steps was ~15%. This yielded 49 candidates that bound to SNA and 76 that bound to AAL (Fig. 6). Of these, we removed glycosites from highly polymorphic HLA class I histocompatibility antigens, which are variably expressed in the population. The final list of 100 glycosites, from 83 glycoproteins, that were putative triple negative-specific candidates is shown in Table 2.
Next, we asked whether the glycosites we identified could have been predicted from transcriptome analyses. To answer this question, we used existing exon expression array profiles for all of the cell lines and RNAseq data for 9 of the 10. Since the two platforms identified similar sets of differentially expressed genes, we performed statistical analyses using values from the RNAseq experiments, which are better able to differentiate signal from noise (Supplementary Table 4). These analyses showed that 46 of the 83 mRNAs encoding the protein scaffolds that carried biomarker glycosites were upregulated ≥ 2-fold in triple negative vs. luminal cells. This suggested that the differential detection of these glycosites in triple negative CM samples may have been attributable to differences in relative protein abundances. In contrast, more than half of the triple negative-specific candidates could not have been predicted from the mRNA expression data, as there was no difference in mRNA abundances between the luminal and triple negative subsets. The identification of these glycosites may have been driven by alterations in the protein glycosylation machinery of triple negative cell lines. To address this possibility, we looked for differences in mRNA levels of the transferases that add fucose (recognized by AAL), and sialic acid (recognized by SNA). The results are shown in Supplementary Table 5. Two fucosyltransferases and 8 sialyltransferases were differentially expressed, either up or downregulated, in triple negative vs. luminal cell lines. Given that we observed both gains and losses of enzymatic activity, it is difficult to predict, in structural terms, the net consequences of these changes. However, our glycosite data are empirical evidence of subtype-specific glycosylation patterns in breast cancer.
Initial inspection of the 100 triple negative-specific candidates showed that many targets were derived from glycoproteins that are involved in cancer-relevant processes. To more fully explore this correlation, we performed pathway analyses using two bioinformatics resources: Kyoto Encyclopedia of Genes and Genomes (KEGG) and Ingenuity (IPA). However, the programs recognized only small portions of the dataset, together matching 38% of the total proteins (Supplementary Tables 6 and 7), and most of the results were driven by only a few molecules, e.g., integrins. As an alternative, literature searches enabled assignment of biological functions to 90% of the putative triple negative-specific glycoproteins. Three prominent, interrelated themes emerged—38% of the targets were up- or downstream components of the TGFβ pathway; 21% were involved in ECM remodeling; and at least 18% were proteinases or proteolytic targets. Minor recurring associations included the epithelial to mesenchymal transition (EMT, 9%) and bone morphogenic protein signaling (6%).
TGFβ signaling governs important aspects of ECM remodeling and proteinase activities. Through the synthesis, cross-linking, and degradation of a variety of protein and carbohydrate matrix components, the composition and tensile strength of the ECM are modulated, both of which dramatically influence the behavior of surrounding cells 42, 43. With respect to cancer, these activities are strongly associated with increased migration and invasion. TGFβ is also considered to be a central mediator of EMT, through both canonical (i.e., Smad-dependent) and non-canonical (e.g., PI3K and MAPK) pathways 44. Cells undergoing EMT lose apical-basal polarity and stabilizing adhesive epithelial interactions in exchange for the acquisition of a more migratory mesenchymal phenotype. These changes can lead to cell invasion and metastasis, functions that have been linked to TGFβ activity 45, 46. Thus, as a group, the putative triple negative-specific targets we identified were derived from proteins with striking functional similarities and disease relevance 47. It is possible that these biomarker candidates may also suggest subtype-specific clinical targets, which currently do not exist for triple negative breast cancer 18, 19.
The heterogeneous nature of breast cancer is widely accepted 13. Tumor subtyping is commonly based on immunohistochemical analyses of tissue sections cut from biopsies to profile expression of a marker panel—ER, PR, HER2, cytokeratin 5/6 and epidermal growth factor receptor. Increasingly, clinicians are using this information to determine prognoses and optimize treatment 48. For example, the risk prediction tool Adjuvant!Online (www.adjuvantonline.com) can be used to identify the patients who will benefit most from postoperative treatment(s). Although immunohistology-based diagnoses are changing the clinical oncology landscape and improving patient outcomes, there remains much room for advancement. Currently, subtype diagnoses require identification of a lesion, and an invasive procedure to obtain a biopsy. Therefore, the need for circulating biomarkers that serve as sentinels of breast cancer and enable subtyping remains great.
In this context, our biomarker discovery method used cancer cell line CM, i.e., the secretome, as the starting material to identify candidate glycoproteins that carried putative subtype-specific N-glycosites. For the enrichment step, we used lectin capture at the glycopeptide, rather than glycoprotein level. This approach gives more information, in terms of glycan composition and location along the peptide backbone, than other commonly used related methods (e.g., lectin chromatography at the glycoprotein level, and hydrazide- or boronic acid-mediated chemical capture of glycoproteins/glycopeptides) 24. Accordingly, we interrogated a largely unexplored biomarker discovery space. This theory is substantiated by the fact that only four of the targets that we identified were among the 150 most abundant plasma proteins as described by Hortin et al. 49. Furthermore, only 52 of the targets were among the recently published high-confidence human plasma proteome that included estimated protein concentrations 50. Of those found in this dataset, 73% were predicted to be <50 ng/mL, while 40% were likely to be <10 ng/mL, reasonably low background levels against which to observe circulating disease-derived signals. As additional support for this concept, only six of the putative triple negative-specific N-glycosites from five glycoproteins were found in a previous study in which we used the same workflows and AAL or SNA chromatography to analyze a sample of NIST pooled human plasma from 100 healthy individuals 24. These included glycosites from CD109, CD44, clusterin, extracellular matrix protein 1, and pigment epithelium-derived factor.
In summary, the workflow that we developed could serve as a blueprint for biomarker discovery. In this paradigm, an initial candidate list is developed using an easily obtained renewable material, such as cell line CM, rather than valuable, and often difficult to obtain, clinical samples such as plasma or serum. As studies that employ targeted enrichment strategies are considerably more sensitive than shotgun proteomics methods, the ability to generate a candidate biomarker list from a biologically-relevant source significantly improves the chances of success during the subsequent verification stage 51. This method may be especially useful for diseases, such as ovarian cancer, for which the cell type of origin is uncertain and, consequently, it is difficult to choose control samples 52, 53. A limitation of the method is that O-linked and intact N-linked glycopeptides are not analyzed due to the absence of universal enzymes to remove carbohydrates and the lack of sufficiently powerful software for rapid identifications, respectively. However, we do not view this as a liability. This workflow was designed as a high-throughput platform to generate biomarker candidates for subsequent verification by MRM. In general, due to heterogeneity, endogenous glycopeptides make poor MRM targets. By contrast, our method yielded a list of putative biomarker targets for direct follow up in clinical samples, and is easily accessible to any laboratory performing proteomics. Indeed, several groups have recently employed similar methods to identify candidate biomarkers of various cancers including prostate, colon, thyroid and breast 54–57. Interestingly, a few of the biomarkers that we identified were also observed in the latter study, suggesting that this general approach is reproducible and robust 54. Finally, this workflow is well suited to the development of a multiplexed clinical assay, analogous to a reverse protein array approach, with antibody capture as the first step and lectin binding as the second.
We used a lectin chromatography/MS-based approach to screen conditioned medium from a panel of luminal (less aggressive) and triple negative (more aggressive) breast cancer cell lines. The samples were fractionated using lectins that recognize fucose and sialic acid. In total, we identified 1011 glycosites from 533 glycoproteins. Statistical analyses suggested that a number of these glycosites were triple negative-specific and thus potential biomarkers for this tumor subtype.
We thank Ms. Tiffany Sham for excellent assistance formatting tables. This work was supported by an NCRR shared instrumentation grant S10 RR024615 (BWG) and by grants from the National Cancer Institute, U24 CA126477 (SJF) and a U24 Subcontract (BWG) that are part of the NCI Clinical Proteomic Technologies for Cancer initiative (http://proteomics.cancer.gov). Additional support was provided by the Director, Office of Science, Office of Biological & Environmental Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, by the National Institutes of Health, National Cancer Institute grants P50 CA 58207, the U54 CA 112970, the U24 CA 126477 and the NIH NHGRI U24 CA 126551 for JWG. A portion of the mass spectrometric analyses was performed in the UCSF Sandler-Moore Mass Spectrometry Core Facility, which acknowledges support from the Sandler Family Foundation, the Gordon and Betty Moore Foundation, and NIH/NCI Cancer Center Support Grant P30 CA082103. OLG is supported by the Canadian Institutes of Health Research and the Stand Up To Cancer-American Association for Cancer Research Dream Team Translational Cancer Research Grant SU2C-AACR-DT0409.