|Home | About | Journals | Submit | Contact Us | Français|
This invited paper reviews the study of protein glycosylation, commonly known as glycoproteomics, beginning with the origins of the subject area in the early 1970s shortly after mass spectrometry was first applied to protein sequencing. We go on to describe current analytical approaches to glycoproteomic analyses, with exemplar projects presented in the form of the complex story of human glycodelin and the characterisation of blood group H eptitopes on the O-glycans of gp273 from Unio elongatulus. Finally, we present an update on the latest progress in the field of automated and semi-automated interpretation and annotation of these data in the form of GlycoWorkBench, a powerful informatics tool that provides valuable assistance in unravelling the complexities of glycoproteomic studies.
Glycoproteomics, as distinct from proteomics or glycomics, is the study of the glycosylation of proteins, a covalent modification which confers altered physico-chemical properties and functional activity on the nascent protein chain. There are two broad classes of protein glycosylation in nature, those ‘O-linked’ to Serine or Threonine residues in the protein backbone, and those ‘N-linked’ to Asparagine residues. Mass Spectrometry (MS) has played a key and irreplaceable role in defining the structures of glycoproteins over the past 30 years [1,2] using methods developed from the earlier studies on Antarctic fish blood “antifreeze” and prothrombin glycoproteins [3,4] together with the general ‘mass mapping’ strategy  of determining and screening the masses of peptides/glycopeptides produced from specific proteolytic or chemical digests, which itself evolved from earlier ‘mixture analysis’ approaches to protein and glycoprotein sequencing [6,7]. The concept of mapping (sometimes called fingerprinting) derives from the realisation in the late 1970s that the data set comprised of peptide molecular ions M+. or quasimolecular ions [M+H]+ produced by digesting any given protein is likely to be unique (especially if more than one digest is used), and therefore it provides a reasonable diagnostic for characterisation or identification of the protein, distinguishing it from others, importantly without the need for sequencing. From 1981, early research applications of mass mapping ranged from the screening of recombinant proteins and glycoproteins for the Biotech industry, detecting errors of translation or confirming mass matching and thus identity  using a software mass-search aid ProtMap, through to assisting in the structural characterisation of new peptide hormones  and the detection and characterisation of glycosylation in human Interleukin 2 . With the later advent of comprehensive computerised protein databases, the peptide maps could then be used to interrogate those databases for matches to, and thus identification of unknown proteins , which in turn has stimulated the general development of the field of proteomics.
Two unique strengths of mass mapping were, and still are, the ability to map (and therefore visualise) the N-terminal and C-terminal domains of a protein with equal probability, and most importantly, the ability to discover post-translational modifications (PTMs) including glycosylation by detecting mass shifts in component peptides in the mass map or by locking on to sugar mass differences in the map, created by facile glycosidic bond cleavage. Once detection is achieved in this way, a whole battery of techniques including MS/MS can then be applied to determine even the most complex structures, and this laboratory has reported many such novel glycosylation studies over the past twenty years including defining the glycosylation of tissue plasminogen activator , of pro-opiomelanocortin (POMC)  of glycodelins A  and S , cytoplasmic glycosylation of Skp 1 [16,17], multiple ‘O-linked’ glycosylation of CD8  and an unexpected novel ‘N-linked’ glycosylation in C.jejuni glycoproteins .
Despite those advances, the field of glycoproteomics remains a difficult one to enter for the new researcher, largely due to the sheer complexity and variability in the protein glycosylation we observe in most areas of biological research. In this paper, we attempt to demonstrate the further refinement of the strategies outlined above, with the aim of defining a generic approach to glycoproteomics, illustrated with advanced studies in which the interactive informatic tools which we are currently using to assist in detailed interpretation of MS and MS/MS data are described.
A number of reviews have been published which document the historical perspectives, principles and practice of glycoproteomic analysis [1, 2, 20-25]. The aims of this section are to highlight general issues and to suggest where efforts need to be focused to enable glycoproteomics research to be carried out more effectively. Firstly, of course, a key basic requirement is a well-found laboratory, for example at Imperial this includes 3 electrospray (ES) Q-TOF type instruments (including a Q-Star), 2 matrix-assisted laser desorption ionisation (MALDI) 4800 TOF-TOFs plus a range of ancillary equipment, such as gas chromatography (GC)-MS for composition and linkage analysis and nano-liquid chromatography(LC) for sample presentation both to the Q-TOFs and TOF-TOFs, in the latter case via a Probot auto-spotter. Broadly speaking, the majority of laboratories engaged in glycoproteomic analyses employ all or part of what has become a generic workflow as illustrated in Figure 1, with specific methodologies being dictated by available infrastructure, instrumentation and expertise. The type of sample being analysed will also influence the choice of methodology. For example, although not always applicable to large and/or highly heterogeneous glycoproteins, molecular weight profiling of intact glycoproteins (purple arrows in Figure 1) can sometimes provide very useful information on the type and extent of glycosylation. Such “top-down” methods have proven especially powerful, in bacterial glycoproteomics where novel glycans are frequently observed [26-28], and in studies on intact antibodies for the Biopharmaceutical industry, where M-Scan routinely screens intact masses at around 150 kilodaltons by both MALDI-TOF and ES-Q-TOF to give confirmatory total mass analysis when reconstructing the detailed protein and carbohydrate profiles from mass mapping studies.
Central to all general glycoproteomic strategies is the mass spectrometric analysis of glycopeptides, usually after chromatographic separation, either on-line (red arrows) or off-line (blue arrows), to simplify the maps produced. Glycopeptides are normally obtained by specific proteolytic or chemical digestion of glycoproteins present in gel bands, immunoprecipitates or tissue extracts.
“Bottom-up” analysis, as the name suggests, begins with the analysis of individual glycans, building back through an analytical tree towards the intact glycoprotein, gathering information such as glycan structure, glycan repertoire, heterogeneity and sites of attachment. In the off-line approach, where chromatographic fractions are collected for individual analysis, this is of course generally more time-consuming and less suitable to automation, but it benefits from the fact that one can then apply a wider array of mass spectrometric equipment and methods, with far fewer issues than on-line analysis, where the compatibility of the eluents with mass spectrometric analysis and other limitations arising from the coupling of the chromatographic output to the ionisation source may restrict the applications somewhat.
Parallel glycomic analyses (illustrated by the green arrow) are also invaluable to glycoproteomic studies, providing information concerning the specific glycans present and the relative levels of the individual structures. Typically, specific glycan populations are released by enzymatic and chemical digestion of the peptide/glycopeptide mixture, derivatised by permethylation in order to enhance the separation, fragmentation, detection and stability of the constituent structures and then subjected to mass spectrometric analysis. A variety of enzymatic and chemical digestions of the released glycans can be employed prior to analysis if more specific structural information is desired.
Information gathered from these analytical approaches, the quantity of which can be vast, can then be used to search for and identify putative glycopeptides via characteristic fragment ions, with compositions being assigned through the use of both biosynthetic information and MS and MS/MS data.
Individual glycopeptide glycoforms are often very minor constituents compared with the peptides derived from the proteolytic digestion. Hence enrichment of glycoproteins and/or glycopeptides may be essential prior to analysis to ensure that glycoproteomic information is not obscured by vast quantities of proteomic data. Lectins  and hydrophilic affinity gels  are useful tools for glycoprotein/glycopeptide enrichment. Lectins with relatively broad specificity such as concanavalin A (conA), which recognises high mannose, hybrid and biantennary complex-type N-glycans, are showing considerable promise for enriching serum glycoproteins , whilst more specific lectins such as as Vicia villosa lectin (VVL), which preferentially binds to alpha- or beta-linked terminal GalNAc, are valuable tools for purifying glycoproteins that carry the cognate structure. For example bovine pregnancy associated glycoproteins which are rich in glycans carrying the Sda epitope (NeuAcα2-3(GalNAcβ1-4)Gal-) can be efficiently purified from placental tissue using VVL affinity columns . Despite many examples of successful glycopeptide enrichment, the development of such methodologies that are applicable to a wide spectrum of samples is still a major challenge for glycoproteomics.
An even greater challenge occurs at the end of the workflow due to the paucity of informatic tools that are available to aid data interpretation (see below). To illustrate the data handling issues that need to be addressed in glycoproteomic experiments, exemplar data from on-line nanoLC-ES-MS analysis of a tryptic digest of a sample of human Glycodelin  are shown in Figure 2. The upper panel shows the total ion chromatogram for the complete LC-MS run whilst the middle panel shows a summation of the mass spectra corresponding to the part of the chromatogram that is highlighted in yellow. This region of the chromatogram has been chosen for scrutiny because the presence of characteristic sugar fragment ions (m/z 204, 366, 407 and 512; see annotations on Figure 2B) indicate that glycopeptides are present. Putative glycopeptide molecular ions are observed at low abundance throughout the green shaded region which is magnified in the bottom panel (Figure 2C) for clarity. Parts of the spectrum are expanded in the inserts to illustrate the richness of the data and to give insights into the assignment process (see legend). Interpretation of such complex data is greatly facilitated by prior knowledge of the compositions of the glycans in the glycoprotein sample being studied . For this reason it is often advisable for the glycoproteomic workflow to include glycomics experiments which define the total glycan repertoire of the samples under study (see Figure 1, green arrows) [21,34,35]. A recent study reporting on the glycoproteome of mouse uterine luminal fluid  is a useful example of how to optimise the complementarity of the pathways shown in blue, red and green in the Figure 1 workflow.
Automated identification of proteins from MS and MS/MS spectra is now almost routine via the usage of informatic tools such as Mascot (http://www.matrixscience.com/). A major factor restricting progress in the glycomics/glycoproteomics field is the lack of rapid, accurate and flexible automated tools capable of retrieving structural information from MS data. The complexity of the glycan structures and the variety of techniques that are used for their study, pose additional obstacles to the development of a single automated tool that could have the same impact on glycoproteomics as tools such as Mascot have had for proteomics. Library-based sequencing tools for MS data interpretation, similar to the methods now commonly used for proteins, are limited by the lack of availability of comprehensive and well-curated collections of glycan sequences. De-novo sequencing tools and composition analysis tools are not restricted to previously characterized structures, but expert knowledge is fundamental to restrict the number of solutions matching experimental data and to obtain reasonable results. Probably the most successful of these tools is Cartoonist which has been designed to incorporate the same assumptions used by human expert annotators. Information about biosynthetic pathways is encoded in Cartoonist as a library of several hundred archetype glycans and a set of rules to modify these structures. Additional constraints are enforced to further limit the amount of possible structures. The peaks are then annotated by searching for all the structures that can be generated from the archetypes that match the given mass. A software calibration is performed for each spectrum to match observed and predicted masses. Calibration results together with isotope envelope shapes are then used to assign confidence scores to peak annotations. Finally, a graphical output is created by superimposition of the assigned structures on the actual spectra . The Cartoonist algorithms continue to be developed with the objective of reducing the amount of prior expertise required . An additional development has been to expand the tool’s use to glycoproteomics. Peptoonist now automatically identifies the glycans present at each N-glycosylation site of a glycoprotein . Similar objectives have been achieved by combining generated glycopeptide MS/MS data with in silico workflows. These consist of comparing the original spectra to a N-glycopeptide library to assign the peptide sequence and predicting the N-glycan composition. Both processes are statistically validated to obtain best fit glycopeptides .
One of the major advances in our efficiency has come from the development of GlycoWorkbench. The aim of this tool is to provide complete support to the routine interpretation of glycomic mass spectrometric data and to form the basis for the development of fully automated assignment software. GlycoWorkbench comprises several features designed to help the user annotate their MS data. For example, the visual editor of glycan structures, the GlycanBuilder  enables a rapid assembly of graphical representations of structural models. Indeed with GlycanBuilder, a glycan can be rapidly specified starting from the reducing end by sequentially adding monosaccharides, modifications, or reducing-end markers (for example 2-aminopyridine, 2-aminobenzamide and 2-aminobenzoic acid) to the already drawn structure, simultaneously computing the corresponding theoretical m/z value. Different possible chemical derivatisations of the glycan structures, such as permethylation and acetylation, can also be integrated into calculations of theoretical m/z calculations.
When the structure is too complex and encompasses too many possible arrangements to be easily drawn, GlycoWorkbench offers the possibility to the user of entering the m/z value of the unknown species and then defining a set of parameters such as the presence of chemical modification(s), the nature of the reducing end or the set of possible monosaccharides that could be biosynthetically utilised by the organism studied. These parameters limit the number of possible arrangements. GlycoWorkbench will then compute the theoretical m/z values of the various possible compositions, given the restrictions defined by the user, match them with the experimental m/z value and create a report in which compositions are listed together with the m/z accuracy.
GlycoWorkbench also offers the possibility to assist annotations of MS/MS data in an interactive manner. Various ways of predicting fragments, from the drawn structures or from given m/z values are offered to the user. Spectra can also be downloaded and a list of peaks manually selected. If two possible structures are predicted to be present, then both can be matched against the list of observed fragment ions. The insilico fragmentation engine computes a complete list of theoretical fragments (multiple glycosidic cleavages and all the possible ring fragments). The annotation engine automatically matches the theoretical list of fragment masses with the manually defined experimental peak-list. The proposed annotations are presented using comprehensive and easily understandable reports that allow the comparison of the different annotations from the structure candidates.
The user can create and maintain multiple sets of candidate structures, peak-lists, mass spectra and annotated peak-lists in a single workspace so that all the information generated in an experiment can be organized and stored in a file. The software is publicly available for download from the EUROCarbDB Web site .
The capabilities of this tool are exemplified by our current glycoproteomic studies of a glycoprotein called gp273 which is found in the extracellular coat of the freshwater mollusc bivalve Unio elongatulus, and is believed to be the ligand for sperm-egg interaction during fertilization . This glycoprotein has attracted the attention of the mammalian developmental biology community because it has been shown to bind to human sperm and to induce the acrosomal reaction . Fucosylated epitopes on the O-glycans of gp273 appear to play a role in this binding. Moreover, it has been suggested that an anti-gp273 IgG antibody recognizes a Lewisa (Galβ1-3(Fucα1-4)GlcNAc)-like epitope on gp273. The binding of human sperm to gp273 was found to be reversed by solubilised human zona pellucidae  suggesting that gp273 might have structural features in common with the genuine sperm receptor ligands on human eggs whose identities remain a mystery. Characterization of gp273 glycosylation could, therefore, provide the first clues as to their likely structures. Previous structural studies of gp273 have been confined to defining its approximate molecular weight by MALDI-MS (273 kDa)  and showing by MALDI-MS and nuclear magnetic resonance (NMR) that its major N-glycans are Glc1Man9GlcNAc2 and Man9GlcNAc2 . Nothing has so far been documented concerning its O-glycosylation other than the aforementioned possibility that Lewisa-like structures are present. Information on the nature of the O-glycome is now beginning to emerge from our glycomic analyses exploiting high sensitivity MALDI-TOF/TOF instrumentation and the GlycoWorkBench informatics tool. This research is revealing that gp273 carries a diverse repertoire of O-glycans, many of which are rich in fucose. This is illustrated by the MS and MS/MS data in Figures Figures33 and and4,4, respectively. As shown in the exemplar annotations in Figure 3, the GlycoWorkBench tool facilitates the assignment of possible sugar compositions to each of the molecular ions in the MALDI-MS profile. Putative compositions can then be ruled in or out by information provided by MS/MS experiments. For example, the GlycoWorkBench-annotated MS/MS data in Figure 4 reveal that m/z 912 and 1680 have the compositions Hex2HexNAc1dHex1 and Hex2HexNAc2dHex4 respectively, and not the other compositions shown in Figure 3. Note that the structures shown in the Figure 4 annotations take into account information from linkage analysis experiments (data not shown) which defined the types of monosaccharides in the gp273 glycans. Significantly the MS/MS data showed unequivocally that the fucose residues are mostly found in the context of the blood group H-antigen. Work is in progress to establish whether any Lewisa-like structures are actually present in gp273.
The advances in glycoproteomics over the past four decades have been very significant, in no small part due to key advances in mass spectrometric instrumentation both with regard to ionisation techniques and instrument geometries, but also to the development of strategies, both chemical and biochemical, for dealing with complex glycoproteomic problems.
These advances, building upon the early concepts of biomolecular mass spectrometric strategy and tactics, have been illustrated in the present paper which outlines current glycoproteomic research projects aimed at gaining a better understanding of the first stages of life itself, including sperm-egg binding. We have stressed the important role of new glycoinformatic tools in this work, and the further development of Cartoonist, GlycoWorkbench and other similar software, together with the associated databases, will in future provide a goldmine of expert interpretive knowledge available to all. In summary, we can conclude that in perhaps only one or two more decades at most, glycoproteomic analysis will be carried out as effectively and efficiently and in as automated a fashion as is currently achievable in the related field of proteomics!
This research was supported by the Biotechnology and Biological Sciences Research Council (BBSRC) grant numbers BBF0083091 and B19088 and the Analytical Glycotechnology Core of the Consortium for Functional Glycomics (GM62116). A.C. was supported by the sixth European Union Research Framework Programme (EUROCarbDB RIDS Contract Number 011952). AD was a BBSRC Professorial Research Fellow.