|Home | About | Journals | Submit | Contact Us | Français|
The proteome of human salivary fluid has the potential to open new doors for disease biomarker discovery. A recent study to comprehensively identify and catalog the human ductal salivary proteome led to the compilation of 1166 proteins. The protein complexity of both saliva and plasma is large, suggesting that a comparison of these two proteomes will provide valuable insight into their physiological significance and an understanding of the unique and overlapping disease diagnostic potential that each fluid provides. To create a more comprehensive catalog of human salivary proteins, we have first compiled an extensive list of proteins from whole saliva (WS) identified through MS experiments. The WS list is thereafter combined with the proteins identified from the ductal parotid, and submandibular and sublingual (parotid/SMSL) salivas. In parallel, a core dataset of the human plasma proteome with 3020 protein identifications was recently released. A total of 1939 nonredundant salivary proteins were compiled from a total of 19 474 unique peptide sequences identified from whole and ductal salivas; 740 out of the total 1939 salivary proteins were identified in both whole and ductal saliva. A total of 597 of the salivary proteins have been observed in plasma. Gene ontology (GO) analysis showed similarities in the distributions of the saliva and plasma proteomes with regard to cellular localization, biological processes, and molecular function, but revealed differences which may be related to the different physiological functions of saliva and plasma. The comprehensive catalog of the salivary proteome and its comparison to the plasma proteome provides insights useful for future study, such as exploration of potential biomarkers for disease diagnostics.
Saliva is produced by the three major paired salivary glands (parotid, submandibular (SM), and sublingual (SL)) as well as by numerous minor salivary glands. Besides water, salivary fluid contains proteins, post-translationally modified proteins (e.g., glycoproteins, phosphoproteins), peptides, lipids, minerals, and other small compounds [1, 2]. Upon release of glandular secretions into the oral cavity, the fluid is mixed with a variety of exocrine, nonexocrine, cellular, and exogeneous components to ultimately form whole saliva (WS). Through its various components, saliva participates in maintenance of homeostasis in the oral cavity, lubrication of oral tissues, and facilitation of chewing, speaking, and swallowing. Furthermore, saliva protects the oral cavity from foreign invaders, such as bacteria and viruses, by digestion and inhibition of their growth .
Qualitative and quantitative salivary alterations in secretion or composition, induced by either systemic or oral conditions, can cause functional deficiency of the saliva [4–6]. Sjögren’s syndrome, an autoimmune disease, causes reduction in saliva volume, which leads to dry mouth, difficulties in swallowing and speaking, increased caries and periodontal diseases, and infection of the salivary gland. Saliva from Sjögren’s syndrome subjects contains increased levels of a few major salivary proteins [7–9]. We recently found 42 proteins to be significantly elevated in saliva from primary Sjögren’s syndrome subjects . Oral cancers are also associated with significant changes of the salivary proteins. Using LC-MS/MS, we found five salivary proteins to be significantly elevated in oral cancer patients . Also, changes in the salivary protein composition have been observed in systemic diseases. Alterations in glycosylation of salivary mucins have been associated with cystic fibrosis . Increased levels of amylase and IgA are observed in diabetic patients [12, 13]. A number of salivary components, including cortisol, amylase, and lysozyme, are altered under stress conditions. These alterations suggest that analysis of saliva, especially its protein components and carbohydrate PTMs , may have potential for disease diagnosis and health monitoring. The relatively simple, noninvasive collection procedures and its constant availability make saliva an attractive biofluid for disease detection.
A key initial step for saliva to be of practical use for disease diagnosis and health monitoring is the cataloging of its protein components. However, because of its complexity, variation in protein abundance and PTMs, a comprehensive characterization of the protein composition of salivary fluid could not be achieved through traditional biochemical approaches until the introduction of MS-based, high-throughput proteomics technologies. Recently, several reports with the goal to comprehensive catalog the salivary proteome have been published [8, 10, 15–21], with numbers of proteins identified ranging from hundreds to over 1000. A project to catalog the proteomes from salivary gland fluids of parotid and SM/SL glands identified 1166 proteins, with 914 identified in parotid and 917 in SM/SL fluids, and 665 in common (www.hspp.ucla.edu) .
To appreciate the unique utility of the salivary proteome in the context of its function and potential diagnostic value, it is important to compare the saliva protein composition with other established proteomes, such as plasma. Overlap in protein content between saliva and plasma may indicate that saliva could be used as a diagnostic alternative to blood tests. Over many decades, numerous studies have uncovered how changes in the concentrations of specific plasma proteins have been associated with disease processes, leading to well-accepted clinical applications . Moreover, the plasma proteome is perhaps the most extensively studied human proteome to date. The international HUPO Human Plasma Proteome Project, a collaboration of many laboratories using MS technology, compiled a core dataset of 3020 distinct proteins (with a minimum of two unique peptides per protein) [24–26]; 889 proteins were confirmed as high-confidence identifications through a rigorous statistical approach adjusting for protein length and multiple comparisons testing .
In this present study, we have attempted to construct a comprehensive catalog of the human salivary proteome by integrating protein identifications from both whole and ductal salivary fluids. The salivary proteome was analyzed and compared among whole and ductal saliva as well as to the human plasma proteome. These analyses should greatly facilitate the characterization of these two human body fluid proteomes and should facilitate the discovery and development of diagnostic disease biomarkers.
The proteome of WS was contributed by datasets from four research groups: the University of Minnesota (UMN), Research Triangle Institute (RTI), Calibrant Biosystems/University of Maryland (CB/UM), and the University of California-Los Angeles (UCLA). The datasets include newly acquired data from WS as well as previously published data [17–20, 28]. The experimental methods described below are primarily for the new experiments performed to supplement the list of WS proteins. The lists of salivary protein identifications from ductal saliva, i.e., parotid and SMSL, are the result of a consortium effort by three National Institute of Dental and Craniofacial Research (NIDCR)-supported research groups (Scripps Research Institute, UCLA, and University of California-San Francisco); the methods used by each of the three groups have been described . For the comparison of the salivary proteome to the plasma proteome, the published HUPO plasma proteome dataset was used . The 3020 plasma protein identifications with two or more peptides were obtained from http://www.bioinformatics.med.umich.edu/hupo/ppp. The dataset is available also at the European Bioinformatics Institute (http://www.ebi.ac.uk/pride), and it has been incorporated into the Peptide Atlas at the Institute for Systems Biology (http://www.peptideatlas.org).
Whole, unstimulated saliva was collected from four healthy individuals using a previously described protocol . WS (1 mL) was removed and centrifuged at 25 000 × g at 4°C for 30 min. The supernatant was collected and quantified by using the BCA protein assay with BSA as a standard control (Pierce), giving 1.05 mg of total soluble protein per mL. Equal amounts of soluble saliva (200 µL) were combined from the four individuals. The combined saliva was brought to 100 mM with HEPES, pH 8.0 and 5 mM with Tris-(2-carboxyethyl)- phosphine (TCEP) and incubated overnight with 20 µg of trypsin (Promega, Madison, WI) at 37°C. The resulting peptides were concentrated and desalted using an RP Sep-Pak cartridge (Waters, Milford, MA) and dried by vacuum centrifugation.
Preparative IEF of the tryptic peptide mixture was performed using a commercially available ProTeam free-flow electrophoresis (FFE) system (BD Biosciences, Franklin, NJ), as described previously . Approximately 50% of each FFE fraction was taken from each of the microtiter plate wells containing peptides and processed as described , and a second step of fractionation was performed using a PolySULFOETHYL strong cation exchange (SCX) guard column (Javelin guard column, 1.0 mm id × 10 mm, 5 µm, 300 Å, PolyLC) using an automated syringe pump capable of highly accurate sub-microliter per minute flow rates (Harvard Apparatus). Each peptide fraction was re-dissolved in 200 µL of SCX loading buffer (10 mM KH2PO3 containing 20% ACN, pH 3.0) and loaded onto a preconditioned SCX column at a flow rate of 50 µL/min. Peptides were eluted with step-gradient chromatography, using steps with increasing KCl concentration, at a flow rate of 50 µL/min. Eluted fractions from salt steps of 20, 25, 50, and 200 mM KCl in loading buffer were collected (200 µL total volume); each collected fraction was concentrated by vacuum centrifugation, and reconstituted in 30 µL of HPLC loading buffer.
All online µLC separations were done on an automated Paradigm MS4 system (Michrom Bioresources, Auburn, CA), coupled with an LTQ linear IT mass spectrometer (ThermoFisher Scientific, San Jose, CA) as described previously [17, 18]. Acquired MS/MS spectra were searched using SEQUEST  (Bioworks version 3.2, Thermo Finnigan, San Jose, CA) against a nonredundant human protein sequence database from the European Bioinformatics Institute (ipi.HUMAN.v3.18.fasta, containing 62 000 entries). A reversed-sequence version of the same database was appended to the end of the forward version for the purpose of false positive rate estimation . Differential amino acid mass shifts for oxidized methionine ( Da) were also included. Precursor peptide mass tolerance was ±2.0 Da with no tryptic specificity. Fragment ion tolerance was set to ±1.0 Da. To each matched peptide sequence a predicted pI using the Shimura algorithm  was automatically assigned using an in-house developed script developed. The search results were validated using the peptide validation program PeptideProphet . The peptide sequence match results were organized and interpreted using the software tool Interact . Peptide matches (regardless of assigned P score) were kept for further consideration only if their predicted pI was within ±0.5 U of the average pI value for the FFE fraction from which they were identified, and the peptide sequence was at least partially tryptic to maximize the high confidence matches . The estimated false positive rate for our protein catalog was 1%.
Whole, unstimulated saliva was collected from a healthy individual into a 50 mL conical centrifuge tube and stored at −80°C until use. Prior to trypsin digestion, the saliva was centrifuged at 5000 × g for 5 min to remove debris. Total protein content of the supernatant was quantified using a Bradford protein assay, with BSA as a reference standard (Pierce), and a total of 1 mg of protein was digested with modified trypsin (Promega) at a ratio of 50:1 (sample/protease) at 37°C for overnight. Digests were desalted using a C18-“light” Sep-Pak (Waters).
Salivary peptides were focused on IPG-IEF strips, as previously reported [36–38]. Briefly, a 24 cm pH 3.5–4.5 IPG strip (GE Healthcare) was rehydrated overnight with 1 mg of peptides re-suspended in 8 M urea, 0.5% carrier ampholytes. The strip was subsequently focused using an IPGPhor II (GE Healthcare) according to the manufacturer’s provided protocol. The strip was manually cut into 60 fractions of ~4 mm width. Each fraction was sequentially extracted with 200 µL of 0.1% TFA, 200 µL of 0.1% TFA/50% ACN, and 200 µL of 0.1% TFA/100% ACN. The pooled peptide extracts were dried, resuspended in 0.1% TFA, and then further purified using an Oasis HLB SPE (Waters) resin in a 96-well plate format. Vacuum-dried (Speed-Vac) peptide extracts were subsequently resuspended in 40 µL of 0.1% TFA.
Extracted peptide fractions were subjected to LC-MS/MS analysis on a ThermoFisher Scientific LTQ Classic quadrupole IT equipped with a New Objective (Woburn, MA) Picoview nanospray source coupled to an Eksigent (Dublin, CA) Nano-2-D LC System equipped with an integrated Valco 10-port switching valve and peltier-cooled micoautosampler. The column, which was integral with the nanospray tip, consisted of a 100 µm id × 360 µm od × 10 cm piece of fused silica packed with a monodisperse 5 µm polymeric packing material (5RPC, gift from GE Healthcare, Piscataway, NJ). Three microliter of each dried peptide fraction was loaded onto a capillary sample trap (packed with the same material as the column) and washed briefly with 0.1% aqueous formic acid (FA) (5 min) before switching in-line with the analytical column. The HPLC gradient was 80 min in length and progressed from 15 to 50% B (A: aqueous 0.1% FA, B: 70% ACN with 0.1% FA) at a flow rate of 250 nL/min.
The mass spectrometer was programmed to take sequential scans of the following mass ranges (400–600, 600–700, 700–800, 800–900, and 900–1300 m/z) followed by data-dependent MS/MS of the three most intense ions in each mass range, except in the case of 400–600 m/z where only the two most intense ions were analyzed. Dynamic exclusion was enabled with a repeat count of 2, repeat duration of 60 s, and an exclusion duration of 120 s.
The database employed was the International Protein Index (IPI), human version 3.19. A reversed version of the same database was indexed for tryptic peptides and searched against MS/MS spectra using TurboSEQUEST (ThermoFisher Scientific). Data were subjected to reverse database  and pI-filtering using in-house developed software (IDSieve) as previously reported . Actual SEQUEST crosscorrelation score (Xcorr) cutoffs were determined for each fraction based on the Xcorr of the highest scoring reverse database hit as a function of charge state for an empirical peptide false discovery rate of ~1%.
Whole, unstimulated saliva was collected from a healthy male volunteer. One milliliter of saliva was placed in a tube containing a mixture of protease inhibitors (1 µg aprotinin, 1 µg pepstatin A, and 1 µg leupeptin) and centrifuged at 20 000 × g for 30 min. The supernatant was collected and placed in a dialysis cup (Pierce, Rockford, IL) and dialyzed overnight at 4°C against 100 mM Tris, pH 8.2. Urea and DTT were added to the sample with final concentrations of 8 M and 1 mg/mL, respectively, and incubated at 37°C for 2 h under nitrogen. Iodoacetamide was added to a concentration of 2 mg/mL and kept at room temperature for 1 h in the dark. Trypsin was added at a 1:20 w/w enzyme-to-substrate ratio and incubated overnight at 37°C. The protein digest was desalted using an RP trap column (Michrom Bioresources), eluted with a peptide concentration of 2.0 µg/µL, and lyophilized to dryness using a Speed-Vac (ThermoSavant, San Jose, CA), and then stored at −80°C.
Transient capillary isotachophoresis/CZE (CITP/CZE) was the basis of the multidimensional separations strategy employed. The CITP apparatus was constructed in-house using a CZE 1000R high-voltage power supply (Spellman High-Voltage Electronics, Plainview, NY). A 80 cm long CITP capillary was initially filled a background electrophoresis buffer of 0.1 M acetic acid at pH 2.8. The sample containing saliva protein digests was prepared in a 2% pharmalyte solution and was hydrodynamically injected into the capillary. A positive electric voltage of 24 kV was then applied to the inlet reservoir, which was filled with a 0.1 M acetic acid solution. The cathodic end of the capillary was housed inside a stainless steel needle using a coaxial liquid sheath flow configuration. A sheath liquid composed of 0.1 M acetic acid was delivered at a flow rate of 1 µL/min using a syringe pump (Harvard Apparatus 22, South Natick, MA). The stacked and resolved peptides in the CITP/CZE capillary were sequentially fractionated and loaded into individual wells on a moving microtiter plate.
Each peptide fraction was analyzed by nano-RP LC equipped with an Ultimate dual-quaternary pump (Dionex, Sunnyvale, CA) and a dual nano-flow splitter connected to two pulled-tip fused-silica capillaries. These two 15 cm long capillaries were packed with 3 µm Zorbax Stable Bond (Agilent, Palo Alto, CA) C18 particles. Nano-LC separations were performed in parallel in which a dual-quaternary pump delivered two identical 2 h organic solvent gradients with an offset of 1 h. Peptides were eluted at a flow rate of 200 nL/min using a 5–45% linear ACN gradient over 100 min with the remaining 20 min for column regeneration and equilibration. The peptide eluents were monitored using a linear IT mass spectrometer (LTQ, ThermoFisher Scientific) operated in a data-dependent mode.
Raw LTQ data were converted to peak list files by msn_extract.exe (Thermo Fisher Scientific). The program OMSSA was used  to search the peak list files against a decoyed Swiss-Prot human protein sequence database. This database was constructed by reversing all 12 484 real sequences and appending them to the end of the sequence library. Searches were performed with the following parameters: fully tryptic, 1.5 Da precursor ion mass tolerance, 0.4 Da fragment ion mass tolerance, one missed cleavage, alkylated cysteine as a fixed modification, and variable modification of Met oxidation. The false positive rate for peptide identifications was determined as 1%.
WS was obtained from healthy nonsmoking subjects in the morning prior to eating and after rinsing the mouth with water. To minimize protein degradation, protease inhibitor cocktail (Sigma Chemical, 1 µL/mL of WS) and 1 mM of sodium orthovanadate were added immediately to the saliva after sample collection. All samples were kept on ice during the entire process. Roughly 5 mL of clear WS was obtained from pooled individuals after centrifuging at 1300 × g for 5 min. A further centrifugation at 14 000 × g at 37°C for 15 min was performed to remove debris. Protein concentration was determined to be 0.4–1.0 mg/mL (BioRad Protein Assay). The samples were divided into 1 mL aliquots and stored at −80°C.
Ultracentrifugation filters (Microcon YM-10K and YM-3K, Millipore, Billerica, MA, USA) were used to prefractionate the WS into three fractions according to molecular weight: less than 3 kDa, 3–10 kDa, and greater than 10 kDa. Sample processing and trypsin digestion followed protocols described previously .
Additional saliva samples were prefractionated by solution IEF [22, 41, 42]. Proteins in WS were precipitated by mixing with four times the volume of 100% cold ethanol and then incubated overnight at −20°C. The mixture was centrifuged at 13 000 rpm for 15 min at 4°C. The pellet was resuspended in lysis buffer (Zoom 2D protein solubilizer, Invitrogen, Carlsbad, CA) containing Complete Protease Inhibitor (Roche Diagnostic, Indianapolis, IN), Tris base, DTT, and water and sonicated on ice. The pH of the lysate was adjusted to pH 8.5–8.7 with 1 M Tris base and then incubated for 15 min at room temperature with shaking. Sample lysate was reduced for 30 min with 99% dimethylacrylamide (DMA) at room temperature. To quench any excess of DMA, DTT was added and incubated for 5 min at room temperature. After centrifuging the sample for 30 min at 13 400 rpm at 4°C, the supernatant was collected. The protein concentration was determined by the Non-Interfering Protein Assay (Geno Technology, St. Louis, MO) to be approximately 1.5 mg/mL.
Protein lysate (1.5 mg/mL, 400 µL) was diluted to a final concentration of 0.6 mg/mL in dilution buffer consisting of Zoom IEF denaturant, Zoom focusing buffer pH 3–7 (Invitrogen), Zoom focusing buffer, pH 7–12, and 5 µL 2 M DTT. Solution IEF separation with a Zoom IEF Fractionator (Invitrogen) was performed in the standard format (pH 3.0–10). Diluted sample was loaded into each of the five chambers of the fractionator. Five fractions (pI 3–4.6, 4.6–5.4, 5.4–6.2, 6.2–7.0, and 7.0–10.0) were obtained after fractionation. Proteins from each fraction were precipitated by mixing with 70% acetone, incubating at −20°C for 3–4 h and centrifuging at 13 000 rpm for 30 min.
LC-MS/MS was performed on an Applied Biosystems (Foster City, CA) QSTAR Pulsar XL (QqTOF) mass spectrometer equipped with a nanoelectrospray interface (Protana, Odense, Denmark) and an LC Packings (Sunnyvale, CA) nano-LC system. The nano-LC was equipped with a homemade precolumn (75 µm × 10 mm) and an analytical column (75 µm × 150 mm) packed with Jupiter Proteo C12 resin (particle size 4 µm, Phenomenex, Torrance, CA). The released peptides were dried and dissolved in 0.1% FA solution. For each LC-MS/MS run, typically 6 µL of sample solution was loaded to the precolumn. The precolumn was washed with the loading solvent (0.1% FA) for 4 min before the sample was injected onto the LC column. The eluents used for the LC were 0.1% FA (solvent A) and 95% ACN containing 0.1% FA (solvent B). The flow was 200 nL/min, and the following gradient was used: 3% B to 35% B in 72 min, 35% B to 80% B in 18 min, and maintained at 80% B for the final 9 min. The column was equilibrated with 3% B for 15 min prior to the next run.
For online LC-MS/MS analyses, a Proxeon (Odense, Denmark) nanobore stainless steel online emitter (30 µm id) was used for spraying with the voltage set at 1900 V. Peptide product ion spectra were recorded automatically during the LC-MS/MS runs by information-dependent analysis (IDA) on the mass spectrometer. Argon was employed as the collision gas. Collision energies for maximum fragmentation efficiencies were calculated using empirical parameters based on the charge and m/z of the peptide precursor ion.
Proteins were identified by using the Mascot database search engine (Matrix Science, London, UK). All searches were performed against the EBI human IPI database (version 3.03; release date February 5, 2005). For saliva samples prefractionated by in-solution IEF, DMA modification of cysteines was added to the variable modification list. In all searches, one missed tryptic cleavage was allowed, and a mass tolerance of 0.3 Da was set for the precursor and product ions. A MASCOT score of >25 with a p-value of <0.05 was considered a significant match. False-positive rates were determined to be ~2% by using the method described by Matrix Science (www.matrixscience.com).
Protein and peptide identifications collected from WS, parotid, SMSL, and plasma by the participating groups were imported into a relational database designed specially for storage of proteomics experimental data generated by the NIDCR-supported salivary proteome consortium project.
The list of protein and peptide identifications from WS was derived from several protein database sources. To create a consensus list of protein identifications for each biological sample source (i.e., WS, parotid, SM/SL, or plasma) and to make an effective comparison among the sample sources, the mandatory first step is to standardize the protein identifications in reference to the same protein database through a reproducible algorithm. Therefore, we reassembled the protein identifications based on peptide sequences and chose protein database IPI v3.32 (released in August 2007) as the reference database. The strategy of reassembly (inference) of protein identifications from the peptide level was used previously in both plasma and brain proteome studies [26, 43] and also in the integration of the human peptide sequences with the human genome . The algorithm we used seeks to find the minimum protein identification in a given sample source by the following steps:
Protein identifications were classified based on whether they contained sequence features of secreted signal sequence, transit sequence, or transmembrane domain. The sequence features of the protein identifications were either extracted from the protein annotation file obtained from UniProt/Swiss-Prot database or predicted using the sequence feature prediction programs, SignalP for secretion signal sequences , TargetP for organelle presequences , and TMHMM for transmembrane helix sequences . These programs were obtained from the Center for Biological Sequence Analysis, Technical University of Denmark DTU (http://www.cbs.dtu.dk/services).
IPI protein sequence database and its crossreferences file released in August, 2007 were obtained from ftp://ftp.ebi.ac.uk/pub/databases/IPI/. A flat file format of GO for biological process, molecular function, and cellular component were obtained from the GO database website http://www.geneontology.org/Go.downloads.shtml. A gene map of the online Mendelian inheritance in man (OMIM) was obtained from ftp://ftp.ncbi.nih.gov/repository/OMIM/genemap. Biological pathway information was obtained from the KEGG database (http://www.genome.jp/kegg/).
The significance of comparisons of GO distributions was estimated using the χ2 test. The χ2 test was performed using the statistical package SAS. The adjustments for protein length and multiple comparisons testing reported for the Plasma Proteome Project  were not applied to the salivary proteome results.
The WS peptide and protein identifications and its comparison to the human plasma proteome were stored in a relational database. The details of the relational database can be accessed through the http://www.hspp.ucla.edu/. Briefly, the database was implemented using the open source relational database package MySQL. The database has web interface features that allow users to search and query the database through a variety of parameters including saliva source, protein accession numbers, and keywords.
In parallel to the analysis of ductal salivary proteomes recently reported , the present study reports the characterization of the human WS proteome. The WS protein and peptide identifications include those derived from the high-throughput MS-based experiments performed independently by four research groups reported here, as well as results from previous efforts [16, 18–20, 28]. In total, the four groups submitted 12 679 distinct peptide identifications with a false positive rate of less than 2% and 3196 distinct protein identifications. The four groups implemented diversified protocols for protein fractionation, peptide separation, MS, and database searching algorithms and databases (Table 1).
To create a consensus comprehensive list of WS protein components, we integrated and standardized the heterogeneous protein identifications to the IPI database (IPI v3.32, August 2007 release). The integration process started at the peptide level and resolved a nonredundant minimal set of protein identifications, defined such that within a group of proteins that contain the sequences with 100% identity to a set of peptides, one of them was selected to represent the group of proteins and reported. The computational approach for the integration and standardization was similar to the method introduced previously , and the selection of a representative protein from a group of proteins was similar to that used for the HUPO Plasma Proteome Project . A total of 12 602 of the 12 679 original submitted peptides were found to exactly match that found in IPI v3.32; these peptides were used to infer 2158 distinct proteins. Within the 2158 WS proteins, 702 resulted from single-peptide-based identifications, which were subsequently excluded. We utilized the remaining 1456 identifications, derived from 2 or more peptides, as high confidence identifications for further analyses and comparisons.
Besides proteins from human sources, proteins derived from bacterial sources found in the oral cavity were observed within WS. To exclude these bacterial contaminant proteins, the peptides used to derive the 1456 WS identifications were searched against bacterial protein databases. Only 12 out of the 1456 WS identifications contained peptides that matched also to bacterial proteins; these proteins were excluded from the WS identifications, reducing the number of WS protein identifications to 1444.
A total of 233 out of the 1444 WS proteins were confirmed by all four collaborating laboratories and approximately one half of the proteins (756) were supported by at least two laboratories (Fig. 1A). An approximate relative abundance of the WS proteins was estimated by the number of unique peptides used to derive the identifications and by sequence coverage (Fig. 1). The concordance among the groups increased with proteins having increased number of unique peptides per protein identification (Fig. 1B). Similarly, protein identifications by multiple groups were related to the sequence coverage of the protein (Fig. 1C). The number of identifications confirmed by any two groups reached a maximum when the coverage was 40–50%, while three and four group matches dominated at higher sequence coverage (Fig. 1C).
To study the origin of the salivary proteins, we compared the WS proteome to the ductal parotid/SMSL saliva proteome . Similarly, to examine the common nature of saliva and blood, we compared the saliva proteins to the plasma proteome. To make the comparison effective, the parotid/SMSL proteomes derived from IPI v3.24, and the plasma proteome from IPI v2.23 were integrated and standardized to the reference database IPI v3.32 following the same procedures as implemented for WS. As shown in Fig. 2, 34 and 10% of the distinct peptides identified in WS overlap with the peptides identified in the parotid/SMSL proteome and plasma proteome, respectively. At the protein level, 51% of the 1444 WS proteins overlap with the 1235 parotid/SMSL proteins and 33% overlap with the plasma proteins. The higher overlap observed at the protein level indicates that the same proteins found in the two proteomes do not necessarily depend on the same overlapped peptides. A similar phenomenon was noted in a comparison of brain, plasma, and platelet proteomes .
To create a comprehensive catalog of the human salivary proteome, the proteins found in WS and ductal saliva were combined, resulting in a total of 1939 proteins. This combined WS/ductal salivary proteome was compared to the plasma proteome with regard to their theoretical molecular weight and pI (Fig. 3). The salivary proteome contains a large proportion (20%) of low molecular weight proteins (<20 kDa) in contrast to only 7% for the plasma proteome. In total, 68% of the saliva proteins have molecular weight less than 60 kDa compared to the 37% of the plasma proteins.With regard to the proteins found in common between saliva and plasma, the molecular weight distributions show similarity to the distributions of the salivary proteome with a tendency toward the low molecular weight end, except in the highest MW range (≥200 kDa). A pI comparison of the saliva and plasma proteomes revealed that saliva contains more proteins in the lower and (≤5) higher end (≥11) of the pI scale (Fig. 3B), with an average protein pI of 7.03 and 7.13 for saliva and plasma, respectively. The trend toward a higher proportion of proteins with MW less than 20 kDa observed in the saliva proteome is further manifested in the ductal parotid/SMSL proteome. Compared to 17% in WS, 26% of the parotid/SMSL proteins are less than 20 kDa in size (Fig. 3C). In contrast to the difference in the pI distribution of saliva and plasma, parotid/SMSL, and WS proteomes show very similar pI distributions (Fig. 3D).
The salivary and plasma proteomes were further compared based on their annotation in GO terms of cellular component, molecular process, and biological function (Fig. 4). As expected, compared to the total human proteome, the salivary and plasma proteomes are over-represented in the extracellular component, an indication of secretion (p<0.001). The level of over-representation in the extra-cellular component is further enhanced in the proteins that coexist in saliva and plasma. The salivary and plasma proteins are also over-represented in the cytoplasmic and cytoskeleton components (p<0.001). In contrast, intracellular components are under-represented in saliva and plasma. With regard to biological processes (Fig. 4B), compared to the human proteome, saliva, and plasma are over-represented in the categories of response-to-stimulus, responseto- stress, and cell organization and biogenesis, but are underrepresented in cell communication and other primarily metabolic processes. Interestingly, the distributions of the salivary proteins are significantly enhanced in protein metabolic and catabolic processes compared to plasma (p<0.001). In the GO molecular functional categories, the salivary and plasma proteomes are significantly over-represented in protein binding but are under-represented in nucleic acid binding, transporter activity, and signal transducer activity (p<0.001) (Fig. 4C). In general, the salivary and plasma proteomes showed similar distributions in the GO molecular functional categories. However, exceptions were found in the structural, transcription regulator, and antioxidant functions. Compared to plasma, saliva is significantly over-represented in structural molecule and antioxidant functions but under-represented in the transcription regulator function (p<0.001). The proteins common to saliva and plasma generally show an enhanced tendency in the over-represented and under-represented categories of the salivary and plasma proteins. The distributions of the overlapping proteins are significantly enhanced in the extracellular and cytoplasm of the cellular component, response-to-stimulus, response-to-stress, protein metabolic and catabolic processes, and protein binding, motor, structural molecule, antioxidant, and enzyme regulator of molecular function, but are under-represented in organelle and intracellular of the cellular component, cell communication, and other primary metabolic of the biological process, and nucleic binding, signal transducer, catalytic, and transcription regulator of molecular function.
To test our hypothesis that the body fluids are enriched with proteins that contain secretion sequence signals, we examined the sequence features present in the salivary and plasma proteomes, based on the sequence categories of signal sequence (prepeptide), transit peptide, glycosylation site, and transmembrane region. The sequence annotations were obtained either from the UniProt/Swiss-Prot protein knowledgebase or through the sequence feature prediction programs, signalp, targetp, and TMHMM. As shown in Table 2, 1436 out of 1939 salivary proteins and 1966 out of 2720 plasma proteins have their corresponding entries in the UniProt/Swiss-Prot database. A large portion of the salivary proteins (27%) and plasma proteins (23%) are annotated with a signal sequence at the N-terminus. Consistent with the observations that many salivary and plasma proteins can be glycosylated , the sequence feature annotation shows that 24% of the salivary proteins and 26% of the plasma proteins contain N-linked glycosylation site(s). Both saliva and plasma contain putative transmembrane proteins (11% in saliva and 18% in plasma). In contrast to the high percentage of proteins with a signal sequence, the proportion of proteins with a transit peptide sequence required for protein transport across organelle membranes are low in both saliva and plasma (3.3% in WS and 1.3% in plasma). Considering that the part of WS is from the secretion of the ductal fluids, we also compared the sequence feature of WS to parotid/SMSL saliva. The result shows that the ductal saliva proteome contains 37% proteins with secreted signal peptide in contrast to 21% in WS.
We examined the distinct salivary proteins that are observed in saliva but not in plasma. Because it can be expected that some of the plasma proteins can be present in very low abundance in saliva, we compared the salivary proteome composed from WS, parotid, and SMSL to the plasma protein list including one peptide-based identifications (9555 total proteins). The proteins unique to saliva include those with well known salivary functions, such as proline-rich protein isoforms, amylase, cystatin isoforms, lactoperoxidase, and statherin. Antioxidant proteins, peroxiredoxin-4 and 6, proteinase kallikrein-1, and myeloperoxidase are also identified as unique proteins to saliva (Table 3). The abundance of these distinct salivary proteins are ranked based on the number of unique peptides used to derive their identifications from WS (Table 3).
To further examine the biological roles of the salivary and plasma proteins, we examined and classified the proteins based on their biological pathways extracted from the KEGG pathway database (Table 4). Saliva and plasma proteins contain highest pathway activities in cell communication, carbohydrate and amino acid metabolism, immune system, and signal transduction. Exceptions were found in the signaling molecule and interaction pathway, to which the 108 out of 1415 total entries of the plasma proteins in KEGG were matched, in contrast to only 47 of 1527 total entries for the saliva proteins.
Besides the KEGG pathways shared by many salivary and plasma proteins, a detailed examination revealed that these two proteomes are enriched with Igs. Consistent with the previous report that Igs make up about 5–15% of the total number of salivary proteins , the saliva proteome from the integrated results of WS and parotid/SMSL show that 219 (11.3%) of the 1939 salivary protein components are Igs. Interestingly, a majority of these Igs (141 out of 219) were shown to overlap with the plasma Igs, even though the specimens of saliva and plasma are not from the same individuals (Fig. 5). We also compared the salivary and plasma proteins participating in the KEGG carbohydrate metabolism, immune system, and cell communication pathways. In contrast to the striking high overlap (61%, 141 out of 230) of Igs found in saliva and plasma, only 18% (24 out of 132) overlap was found in the carbohydrate metabolism pathway, 22% (42 out of 188) in immune system, and 27% (65 out of 239) in cell communication. When the comparisons are performed between WS and ductal parotid/SMSL proteomes, the higher overlaps are observed in these KEGG pathways with 57% in carbohydrate metabolism, 46% in immune system, and 43% in cell communication (Fig. 5). This higher overlap between WS and ductal parotid/SMSL is consistent with the closer biological and physiological similarity of these two fluids than between saliva and plasma.
The relative abundance of these Igs varied greatly. The Igs in saliva are identified with 2 to 171 unique peptides and with sequence coverages from below 0.1% to as high as 91%. In plasma, the Igs are derived from 2 to 137 unique peptides. Figure 6 demonstrates that linear correlation of the number of unique peptides observed for the Igs exists between WS and plasma, parotid and plasma, and SMSL and plasma.
To achieve a comprehensive human salivary proteome, we began with construction of the WS proteome. Similar to other large-scale proteome projects such as the HUPO Plasma and Brain Proteome Projects and most recent ductal parotid and SMSL proteome, the intrinsic complexity of the WS proteome made its characterization challenging and is influenced by sample source and collection process, sample preparation, and the protein identification process. The power of combining the datasets from different experimental approaches results in a more comprehensive proteome than any single approach can achieve. A core dataset with 1444 WS proteins was assembled from the integration process. Similar to the reports from the plasma proteome, brain proteome, and parotid/SMSL proteome studies, a large portion (52%) of the WS proteins are identifications measured by only one laboratory. Besides the various sample preparation and experimental measurement approaches employed, variations in saliva sample source and protein concentration also may induce differences in protein identification. The WS proteins confirmed by the four groups (233 proteins) should represent the essential salivary components that are least susceptible to differences in methodology and sample source. These protein identifications include such wellknown salivary proteins as Igs, amylases, cystatins (D, S, C, SA, and SN), proline-rich protein 3, keratins, and mucin-5B.
With our previous characterization of the ductal parotid and SMSL proteomes, our present study showed that the salivary protein components vary with the source and origin of the fluid, i.e., WS or ductal salivas. Although 740 out of 1939 salivary proteins coexist in WS and ductal saliva, 563 are specific to WS and 369 specific to parotid/SMSL. It is known that proteins in WS originate from not only the secretion of salivary glands (i.e., SM, SL, parotid, and minor glands) but also from leakage of plasma, secretion of bronchial and nasal sources, gingival crevicular fluid, bacteria, food debris, and epithelial or other cell debris.
The functions of saliva include lubrication, antimicrobial, protection of mucosal integrity, and digestion. Proteins that participate in one or more of these salivary functions include mucins, amylases, defensins, cystatins, histatins, proline-rich proteins, statherin, lactoperoxidase, lysozyme, lactoferrin, and Igs. The functions of these proteins can be redundant and overlapping. Our study indicates that all of these protein family/isoforms were shared between WS and parotid and SM/SL fluids, although one or more proteins in the family can be specific to the WS, parotid or SM/SL proteome. These observations support the previous hypothesis that a specific protein may not be critical for a specific salivary function because other protein families can maintain its function [50, 51].
Except for Igs, proteins with known salivary functions were commonly, but not always, absent in the plasma proteome. For example, statherin and histatin protein families are specific to saliva. The number of isoforms and abundance of mucin, cystatin, and proline-rich protein families in plasma were significantly lower in plasma than in the WS and parotid/SMSL proteomes.
Similarity and distinction of the salivary and plasma proteomes were revealed also through analysis of their cellular components, molecular functions, biological processes, sequence features, and biological pathways. As expected for body fluids, the GO study of cellular components displayed that both saliva and plasma are over-represented with extra-cellular proteins when compared with the overall human proteome. Surprisingly, saliva and plasma are also enriched with the cytoplasmic proteins, which could result from cell death.However, specific transport pathways may also exist. A recent study of the tear proteome revealed that cytoplasmic proteins are enriched ; a few intracellular proteins were demonstrated as originating from cellular shedding of the epithelium . Tears are produced from the lacrimal gland with a structure similar to serous acini of the salivary gland. Whether the cytoplasmic proteins in saliva and plasma also originate from cellular shedding of the epithelium, as in tear fluid, remains to be determined. The GO analysis demonstrated that saliva and plasma are over-represented in response-to-stimulus and response-to-stress processes, presumably reflecting the functions of these two body fluids in the body’s defense system. Saliva is over-represented in catabolic and protein metabolic processes, which may reflect its major physiologic function in food digestion. As expected, the sequence feature analysis indicated that saliva and plasma contain high proportions of proteins with a signal peptide sequence required for targeting proteins to the ER for subsequent transport through the secretory pathway.
Glycosylation of salivary proteins is believed to play a role in the salivary protective functions. Characterizations of the glycosylated proteins in saliva and plasma have reported 45 proteins in saliva and 303 in plasma as glycosylated proteins [20, 49]. The result of the annotation information extracted from the UniProt knowledge indicates that potentially more glycosylated proteins exist in saliva and plasma.
Several sources can contribute to the overlap of protein identifications of saliva and plasma: (i) leakage of plasma into saliva through intracellular or extracellular routes, including outflow of gingival crevicular fluid; (ii) plasma and saliva may share essential proteins needed to maintain their physiological functions as body fluids; (iii) proteins derived from cell debris may be in close contact with either fluid. We expected that the overlapping proteins from different sources would show different abundance patterns. Classification of the salivary and plasma proteins based on their function in the KEGG pathways revealed that the abundance correlations of the overlapping proteins of saliva and plasma vary with their biological functions. Previous estimates established that Igs contribute 5–15% of total salivary proteins. In the present study, 11% of total salivary proteins identified were Igs, and 64% of these were found in plasma. The source of the Igs in saliva was previously proposed as either from salivary gland secretions or from crevicular fluid [50, 54]. Our study reveals that there is a high correlation between the abundance of the overlapping Igs in saliva and plasma, suggesting that these overlapping Igs could result from leakage from plasma.
The ultimate goal of cataloging the proteins found in body fluids is to use the information for health screening and disease detection. To that end, plasma proteins have proved their value as clinical analytes. Saliva has attracted increased attention in that it provides advantages over other body fluids in its noninvasive collection, constant availability, little need for special equipment, and cost-effectiveness. Diseases such as Sjögren’s syndrome, bacterial and viral infectious diseases, and oral cancer cause alterations of salivary protein expression. Comparison of the salivary proteome with the plasma proteome helps to identify the salivary specific biomarkers as well as plasma-derived biomarkers that have been used in the diagnostics of a variety of human diseases.
Our search of the OMIM database indicated that salivary and plasma proteomes contain a large number of proteins associated with genetic disorders, some of which have known phenotypes. Table 5 shows the gene entries of the salivary and plasma proteomes in OMIM. The saliva proteins were matched to 1183 entries in OMIM; 1089 are disease genes with known sequences and 91 are related with diseases with known phenotypes. Similar distributions are observed for plasma proteins. Proteins present in both saliva and plasma were matched to 310 entries in the diseases with known gene sequence and 47 entries in the diseases with phenotype. Interestingly, a few plasma proteins that are used in clinical diagnostics [55, 56] are also identified in the saliva, including creatine kinase B-type, fibrinogen, hemoglobin, rheumatoid factor, and Igs. These results enhance the potential value of salivary proteins as biomarkers for diagnostics. However, it remains to be determined qualitatively and quantitatively whether these proteins carrying genetic disorders or in combination with the diagnostic plasma proteins can be used as disease biomarkers.
This work was supported by the National Institutes of Health (U01 DE16275 to D.T.W. and J.A.L., U01DE016267 to J.R.Y. and J.E.M., U54DA021519 to G.O., RO1DE17734 to T.J.G.), MEDC grant GR687 (to G.O.), and SAIC/NCI contract SAIC/ NCI 23X110A. J.L.B., J.R.S., B.J.C. and J.L.S. acknowledge funding from the National Institute of Allergy and Infectious Disease, National Institutes of Health under contract No. HHSN266200400067C. J.A.L. acknowledges also support from the W. M. Keck Foundation for the establishment of the UCLA Functional Proteomics Center.
The authors have declared no conflict of interest.