Our results indicate that the incorporation of structural information can improve the prediction of glycan occupancy of N-X-T/S sequons. In our study, the RF predictors generated using structural information, with or without additional pattern information, outperformed the predictors trained on sequence information with statistical significance, as well as a number of other sequence-based servers that were evaluated on our dataset (Supplementary Table S10
). Overall, a comparison between different predictors and their underlying algorithms is complicated by the fact that the predictors (and their respective servers) were developed and optimized on different datasets. Nevertheless, analyses based on our dataset clearly demonstrated that predictors generated using structural properties could give better predictions than those generated solely by sequence information. Differences in local structural features between glycosylated and non-glycosylated sequons were not only observed in our dataset but also in previous studies (Petrescu et al., 2004
). Moreover, recent evidence has shown that specific amino acid side chains could directly stabilize the first N
-acetylglucosamine of the glycan (Culyba et al., 2011
), suggesting that in addition to sequence, structural features could directly affect glycan occupancy.
A comparison of the best predictors built on all four and three of the four structural properties ( and Supplementary Table S9
) suggested that the SS property might be the most important factor in the improvement of the prediction accuracy. Surface accessibility played a lesser role: although the sequons with SA less than the 3.7Å2
threshold had a much higher tendency to be non-glycosylated (13 of the 97 non-glycosylated sequons compared with 2 of the 382 glycosylated sequons), SA of >96% of the entries (464 out of 479) were above the threshold, and thus the property was less effective for improving prediction accuracy. The reason for the lesser effect of local contact order on prediction accuracy could be similar: for example, in the case of L = 13, non-glycosylated entries were enriched in the set with CO >0.3 (4 of the 97 non-glycosylated sequons compared with 4 of the 382 glycosylated sequons); CO of >98% (471 out of 479), however, were below the 0.3 threshold.
The performances of the top predictors encoded by three or four structural properties, ranked by balanced accuracy
Several limitations are inherent to our current approach. The 154 protein structures used to generate the dataset only encompass a very small subset of the eukaryotic proteome. Although we only selected crystal structures with resolution better than 2.5Å to generate the dataset, the presence or the conformation of specific amino acid or sugar residues could still be ambiguous in some cases. To increase the reliability of the computational predictions, the dataset used in our study was extensively curated. Since the eukaryotic and prokaryotic N-linked glycosylation schemes are different (Kowarik et al., 2006
), sequons were only selected from PDB files where the proteins were expressed in eukaryotic systems. To reduce the incidences of false negatives in the dataset, the sequons without ASN-NAG linkage from the PDB files were considered glycosylated if they were annotated as such in UniProt (Apweiler et al., 2004
). The incidences of false negatives were further reduced by considering as glycosylated sequons for which the ASN-NAG linkage could be modeled in the electron densities from the PDB Structure Factor file. Nevertheless, false negative sequons could still be present in the dataset, in cases where the glycosylation site is occupied but both the glycan electron densities were absent and the site was not annotated as glycosylated in UniProt. Furthermore, the ratio of glycosylated to non-glycosylated entries in the dataset was adjusted to roughly 1:1, while the actual ratio of glycosylated to non-glycosylated sites in reality is unknown. Additional curation of the dataset could thus further improve the accuracy of the predictions. Tuning of some of the RF input parameters, such as the maximum depth of each tree and the number of input variables to be randomly selected at each node, could further optimize the performance of the different predictors.
The structural features chosen for the NGlycPred algorithm are less sensitive to the exact coordinates of the protein, and therefore should be suitable for use with homology models. In our analysis, we noticed that the knowledge of side chain torsion angles improved the prediction of N-linked glycan occupancy (data not shown). However, since side chain torsions are more difficult to predict for homology models and might differ dramatically before/after glycosylation, we chose not to include this feature in our models so that the NGlycPred algorithm would be applicable to both crystal structures and homology models. Nonetheless, it should be noted that the accuracy of the predictions may be affected by the quality of the homology models. Also, as NGlycPred uses structural properties of the protein as input, different predictions would be generated for sequons on sequence-identical domains if the tertiary/quaternary context is different (for example, sequons on the outer-domain of HIV-1 gp120 monomer versus sequons on an outer-domain-only construct).
Finally, we note that differences in glycosylation do occur between different eukaryotic species (e.g. mammalian versus insect) as well as in different tissues of the same organism; further improvements in N-glycan prediction may be needed to incorporate these variables. Furthermore, as the addition of N-linked glycosylation typically occurs during protein translation, with the N-X-T/S sequon recognized by the glycosylation machinery as the nascent polypeptide is synthesized and extruded into the endoplasmic reticulum, the theoretical link between structure-based information and N-glycan prediction is unclear: the protein is not yet folded when N-glycans are incorporated. Differences in N-glycan prediction accuracy between artificially incorporated sites and naturally evolved sites may provide insight into this conundrum.
The NGlycPred algorithm described here should provide better prediction of glycan occupancy, and such a prediction is likely to have a number of applications. For example, the ability to silence immunodominant epitopes reliably, through targeted addition of N
-glycans, should contribute to immunogen design. N
-glycan can also affect half-life and trafficking of protein therapeutics and correct prediction of glycan occupancy would be of utility. Thus, despite recent advancement in experimental technology to detect N-linked glycosylation (Kaji et al., 2007
; Zielinska et al., 2010
), a computational algorithm that can quickly identify glycosylation sequons with higher probabilities of glycan occupancy should be of use.