PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
 
BMC Bioinformatics. 2010; 11(Suppl 1): S33.
Published online 2010 January 18. doi:  10.1186/1471-2105-11-S1-S33
PMCID: PMC3009505

A weighted q-gram method for glycan structure classification

Abstract

Background

Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or organism. Many computational approaches have been proposed for analyzing these complex glycan structures, which are chains of monosaccharides. The monosaccharides are linked to one another by glycosidic bonds, which can take on a variety of comformations, thus forming branches and resulting in complex tree structures. The q-gram method is one of these recent methods used to understand glycan function based on the classification of their tree structures. This q-gram method assumes that for a certain q, different q-grams share no similarity among themselves. That is, that if two structures have completely different components, then they are completely different. However, from a biological standpoint, this is not the case. In this paper, we propose a weighted q-gram method to measure the similarity among glycans by incorporating the similarity of the geometric structures, monosaccharides and glycosidic bonds among q-grams. In contrast to the traditional q-gram method, our weighted q-gram method admits similarity among q-grams for a certain q. Thus our new kernels for glycan structure were developed and then applied in SVMs to classify glycans.

Results

Two glycan datasets were used to compare the weighted q-gram method and the original q-gram method. The results show that the incorporation of q-gram similarity improves the classification performance for all of the important glycan classes tested.

Conclusion

The results in this paper indicate that similarity among q-grams obtained from geometric structure, monosaccharides and glycosidic linkage contributes to the glycan function classification. This is a big step towards the understanding of glycan function based on their complex structures.

Background

Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or an organism. Glycans consist of monosaccharides, which are linked to other monosaccharides by a glycosidic bond, which can take on a variety of conformations. Thus a single monosaccharide may be linked to at most four other monosaccharide (children), in addition to its parent monosaccharide. Glycans are usually found on proteins and lipids, linked from its root monosaccharide and branching out. The structure at the leaves are understood to be important for various biological functions [1]. These often consist of many branches and can become rather complex. Although major glycan motifs (often found at the leaves) are known to be biomarkers for specific diseases, the mechanism of this glycan recognition is still not well understood. Bioinformatics methods for glycobiology have recently developed rapidly due to the availability of glycan structure databases provided by major institutions including KEGG and the CFG (Consortium for Functional Glycomics). One of the important bioinformatics techniques applied to glycans is support vector machines (SVMs) for the extraction of species-specific glycan substructures [2]. This method has been further applied to the extraction of leukemia-specific glycan motifs in human [3]. The application of tree kernels for glycan classification [4] was then developed at the same time as the q-gram distribution kernel for disease-specific glycan motif extraction [5].

These methods were all groundbreaking in the sense that glycan structures have complex tree structures and that no one had ever tried, let alone were successful, in finding biologically important glycan motifs from glycan structure data alone. However, glycobiology is still a difficult topic, since these methods have not been further improved nor applied to actual problems in glycobiology. One of the reasons for this lay in the fact that, just as amino acids have physico-chemical properties that allow them to be grouped with one another, the components of glycans (monosaccharides and glycosidic linkages) also have such properties which have been ignored in the previous methods.

Thus, in this paper, we attempt to incorporate such similarity measures in glycan kernels in order to increase the classification accuracy of glycan structures, and thus improve glycan function prediction. Namely, we focused on the q-gram kernel due to the large variety of features that can be handled with a fairly simple kernel. We incorporated similarity measures as resolved by the KCaM (KEGG Carbohydrate Matcher) algorithm [6], a dynamic programming method for tree structure alignment. We then compared the accuracy of our new method compared to the previous methods and show that the prediction can be improved by using our proposed weighted q-gram method.

Methods

We apply the Support Vector Machine (SVM) to classify glycans based on different kernels constructed from their tree structures. We first describe the original q-gram method and then our proposed weighted q-gram method. We constructed three kernels for the weighted q-gram method: Linkage (LK) kernel, KCaM (KM) kernel and Linkage KCaM (LKM) kernel. Linkage kernel considers the similarity of glycosidic linkage, their layers and the tree structure among q-grams. In previous work, the concept of layers was used to indicate the distance of a particular monosaccharide from the root. We also make use of this concept in our kernels by defining the layer of a monosaccharide and its linkage towards its parent as the number of linkages between the monosaccharide and the root. KCaM kernel considers the similarity of tree structure among different q-grams. Linkage KCaM kernel further improves KCaM kernel by incorporating glycosidic linkage and layer similarity of q-grams.

The q-gram method

Suppose we are given n glycans G = {g1,(...), gn}. A q-gram is defined as a subtree with q nodes isomorphic to a path, in which every node has no more than two adjacent nodes, for q ≥ 1. We denote the set of all q-grams existing in these n glycans to be q-gram set An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i1.gif where dq is the total number of q-grams in the set Φq. Let Q-gram set Φ = {Φq}g[set membership]Q, where Q is a set of integers, and we usually take Q = {2,(...),9}. A q-gram is also called as Q-gram if q [set membership] Q. Then Q-gram set Φ is the set of all the Q-grams existing in the n glycans. For each glycan gs, the q-gram representation is a column vector An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i2.gif, where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i3.gif means the number of ith q-gram An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i4.gif in the glycan gs, while the Q-gram representation is the column vector xs, which is the concatenation of the vectors An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i5.gif, where q [set membership] Q. Let An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i6.gif and X = [x1,(...),xn] [set membership] Rd×n, where d is the total number of Q-grams in the set Φ. We note that d = ∑q[set membership]Qdq. The feature space Γqusing q-grams is spanned by the column vectors of Xq, and the feature space ΓQ using Q-grams is spanned by the column vectors of X. The feature space ΓQ is the direct sum of feature spaces Γq for all q [set membership] Q, i.e., ΓQ = [plus sign in circle]q[set membership]Q Γq. The q-gram similarity between glycan gs and gt can be represented as the inner product of their feature vectors,

equation image

Thus the q-gram kernel Kq can be simply obtained by An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i8.gif The Q-gram kernel KQ can then be obtained by KQ = XTX. We note that it is easy to see that KQ = ∑q[set membership]QKq.

Weighted q-gram method

We note that the q-gram method does not consider the similarity among the q-grams in Φq. Suppose the similarity among the q -grams in Φq is represented as a weight matrix Wq, where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i9.gif is the similarity score between the ith q-gram and jth q-gram. We represent the similarity between the two glycan gs and gt as follows:

equation image

Thus the weighted q-gram kernel can be obtained by An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i11.gif. The weighted q-gram method is indeed a generalization of q-gram method. The q-grams and p-grams are considered to have no similarity if p is not equal to q. One can then obtain the weighted Q-gram kernel by

equation image

If we add a weight αq on each q-gram, we can get double weighted Q-gram kernel KαwQ = ∑q[set membership]QαqKwq. Here Wq is required to be a positive definite matrix since the kernel matrix should be positive definite. Let An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i13.gif. The feature space Γwq for weighted q-gram method is spanned by the column vectors of SqXq, and the feature space for the weighted Q-gram method is ΓwQ = [plus sign in circle]q[set membership]Q Γwq.

q-gram similarity

In this section, we discuss some methods to obtain the similarity among the q-grams.

Linkage kernel

Each q-gram is composed of several monosaccharides and bonds. We represent the ith q-gram by An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i14.gif, where li is the layer of this q-gram, σi represents the structure shape of this q-gram, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i15.gif is an ordered set of monosaccharides, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i16.gif is an ordered set of bonds.

For each q, the q-grams with different structure shape is considered to be totally different.

Now we consider two q-grams, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i14.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i17.gif. We denote the similarity between them by Sq(i, j). As mentioned above, if σi and σj is not same, then Sq(i, j) = 0. The similarity between the q-grams with the same structure shape depends on the similarity of the layers, monosaccharides, and bonds between them. We represent the similarity between q-grams An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i4.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i18.gif by

equation image

where Sl(li, lj) is the similarity of the layers of the two q-grams, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i20.gif is the similarity of the monosaccharides An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i21.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i22.gif, and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i23.gif is the similarity of the bonds An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i24.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i25.gif.

For the similarity between the layers, it is easy to obtain the similarity function by using the distance of the layers. We then define the similarity

equation image

After the linkage similarity Sq is constructed for all q-grams, the linkage kernel can be obtained similarly as discussed in the previous section. This method of measuring the similarity among q-grams is denoted as the LK method. For the monosaccharides and bonds, since there are too many possibilities of combinations, in the database, we only choose the most frequent ones, which appear most in mammals. In the KEGG glycan database, the most frequent monosaccharides (including other chemical compounds) and the bonds are shown in Table Table1.1. The similarity among monosaccharides can be obtained from the chemical structure comparison method SIMCOMP developed by Hattori et al. [7,8]. The bonds similarity is set by their chemical meanings and is shown in Table Table22.

Table 1
Statistics about the chemical bonds and monosaccharides in glycan database.
Table 2
Bond similarity. The matrix gives the similarity among chemical bonds in glycan database. Higher score indicates that the two chemical bonds are more similar to each other.

KCaM kernel and Linkage KCaM kernel

KCaM is a tool implementing a polynomial-time dynamic programming algorithm to align glycan tree structures. From KCaM we can obtain the similarity score for two glycan structures in the range of 0 to 100. Here we apply KCaM to the q-gram structures to obtain the similarity scores among the q-grams. We denote this method of measuring q-gram similarity as the KM method.

We note that KCaM also does not consider the similarity of the monosaccharides and bonds among the glycan structures and the layers of q-grams. However, a method of computing the score matrices of monosaccharides and linkages have been developed, which can be utilized by KCaM [9]. Thus we incorporated the linkage similarity and layers into the score matrix and developed a new KCaM method called Linkage KCaM (LKM) method. We then use the alignment scores of all q-grams obtained by the Linkage KCaM as the their similarity scores.

Let the alignment score matrix obtained by KCaM or Linkage KCaM be Sq. Here Sq(i, j) is the alignment score between the ith and jth q-grams. Then the weight matrix Wq can be represented using Sq. Since the kernel An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i11.gif should be positive definite, we revise Wq to An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i27.gif Thus we have the kernel An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-S1-S33-i28.gif

Results and discussion

Glycan data

We used two sets of glycan data to evaluate the classification performance using different kernels we constructed. Glycan structure data was retrieved from the KEGG/GLYCAN database [10] and their annotations are from CarbBank/CCSD database [11]. The first data set includes glycan structures that are known to be related to leukemic cells, as was originally analyzed by [3]. We also tested our method on another data set of glycans related to cystic fibrosis as was previously used by [5]. The data used here is summarized in Table Table3.3. The statistical analysis for all q-grams in the glycan database shows the distribution of each monosaccharide and chemical bond in Table Table1.1. The most frequent chemical bond in the glycan database is b1 - 4, and the most frequent monosaccharide is GlcNAc. The similarity among monosaccharides is calculated by the software SIMCOMP [7,8], and the similarity among chemical bond is obtained by their chemical meanings, which is shown in Table Table22.

Table 3
Glycan data. The data labels, the number of each class and the total number of each data.

Results

We evaluated the performance of our weighted q-gram method by comparing it with the original q-gram method on glycan data. SVM was performed in the environment of Matlab version 2007a. Table Table44 shows the Area Under the ROC curve(AUC) using SVM classifier for the two datasets. The AUC values in Table Table44 are the average of 5-fold cross validation performed 50 times. We used two ways to combine the information from different q-gram representations. First is the Q-gram method, which represents each glycan by the vector of {2,(...),9}-grams. The other is the recently developed multiple kernel learning [12], which generates an optimal weighting of each representation for each classification task. In Table Table4,4, for the traditional q-gram method, KCaM kernel method and Linkage KCaM kernel method, the results using Q-gram and multiple kernel method are both reported. For the linkage kernel method, we only reported the results for Linkage 2-gram and Linkage 3-gram since when q is larger than 3, the q-gram similarity calculation becomes exponential.

Table 4
Results. The results for four methods are reported: traditional q-gram, Linkage (LK) kernel method, KCaM (KM) kernel method and Linkage KCaM (LKM) kernel method.

Note that Linkage kernel considers the linkage similarity among q-grams, KCaM kernel considers the structure similarity among q-grams, while linkage KCaM kernel considers both. The left of Table Table44 reports the results of the classification for leukemic cells. The result shows that all the three kernels for weighted q-gram methods preforms better than traditional q-gram. In particular, for smaller values of q, the weighted kernels obtain higher accuracy, supporting the fact that it has been shown that trimer (tri-saccharide) structures have been most effective in discriminating between glycans related to leukemic cells compared to non-leukemic blood components. The right of Table Table44 reports the results of cystic fibrosis. The AUCs of all the methods indicate that KM kernel performs best for all the methods. This implies that q-gram structure similarity contributes to cystic fibrosis-related glycan classification. In contrast to the leukemic cell data set, the result that linkage and Linkage KCam method performs worse than traditional q-gram method for lower values of q indicates that the q-gram similarity obtained from comparing their glycosidic linkages may not be helpful for the classification of cystic fibrosis. This result is supported by the fact that it has been shown in previous work that the sulfate-GlcNAc dimeric structure, which does not contain any glycosidic bond data, is a potential glycan biomarker. Interestingly, the best performance of KCaM kernel especially for higher values of q indicate that larger structures may be overlooked as potential biomarkers, thus illustrating the effectiveness of weighted q-gram method in this classification task. In conclusion, the results for two glycan data sets both show the improved prediction accuracy of the weighted q-gram method for glycan biomarker prediction.

Discussion

In this paper, we focused on glycan classification by considering the similarity among q-gram. In the traditional q-gram method, similarity among sub-structures of glycans (q-gram) is not considered. Thus it assumes that q-gram similarity contributes nothing in glycan classification. We proposed a weighted q-gram method based on three kernels including linkage kernel, KCaM kernel and linkage KCaM kernel, and then compared the performance of weighted q-gram method with that of the traditional q-gram method using two glycan data sets. The results show that the consideration of similarity among q-grams contributes to higher accuracy of classification. Further research may focus on the biological properties of the glycan structure features that contribute to the increase in performance of these kernels, thus aiding the understanding of the mechanisms behind glycan structure recognition.

Conclusion

We proposed a new approach in this paper to classify glycans based on their tree structures. The method attempts to involve similarity among q-grams in glycan classification. Three kernels (Linkage, KCaM and Linkage KCaM) are constructed for weighted q-gram methods. The experimental results showed its effective role in classification of glycan functions.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LL and KFA conceived the idea and designed the research. TY processed the data. LL performed the research and analyzed the results. LL, KFA and WC wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the two anonymous referees and the editor for their helpful suggestions and corrections. Research supported in part by HKRGC Grant No. 7017/07P, HKUCRGC Grants, HKU Strategy Research Theme fund on Computational Sciences, Hung Hing Ying Physical Research Sciences Research Grant, National Natural Science Foundation of China Grant No. 10971075 and Guangdong Provincial Natural Science Grant No. 9151063101000021.

This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.

References

  • Varki A, Cummings R, Esko J, Freeze H, Hart G, Etzler M. Essentials of Glycobiology. second. New York: Cold Spring Harbor Laboratory Press; 2008. http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=glyco2
  • Hizukuri Y, Yamanishi Y, Hashimoto K, Kanehisa M. Extraction of Species-Specific Glycan Substructures. Genome Informatics. 2004;15:69–81. [PubMed]
  • Hizukuri Y, Yamanishi Y, Nakamura O, Yagi F, Goto S, Kanehisa M. Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydr Res. 2005;340:2270–2278. doi: 10.1016/j.carres.2005.07.012. [PubMed] [Cross Ref]
  • Yamanishi Y, Bach F, Vert JP. Glycan classification with tree kernels. Bioinformatics. 2007;23(10):1211–1216. doi: 10.1093/bioinformatics/btm090. [PubMed] [Cross Ref]
  • Kuboyama T, Hirata K, Aoki-Kinoshita K, Kashima H, Yasuda H. A Gram Distribution Kernel Applied to Glycan Classification and Motif Extraction. Genome Informatics. 2006;17(2):25–34. [PubMed]
  • Aoki K, Yamaguchi A, Okuno Y, Akutsu T, Ueda N, Kanehisa M, Mamitsuka H. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics. 2003;14:134–143. [PubMed]
  • Hattori M, Okuno Y, Goto S, Kanehisa M. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Journal of the American Chemical Society. 2003;125:11853–11865. doi: 10.1021/ja036030u. [PubMed] [Cross Ref]
  • Hattori M, Okuno Y, Goto S, Kanehisa M. Heuristics for Chemical Compound Matching. Genome Informatics. 2003;14:144–153. [PubMed]
  • Aoki K, Mamitsuka H, Akutsu T, Kanehisa M. A score matrix to reveal the hidden links in glycans. Bioinformatics. 2005;21(8):1457–1463. doi: 10.1093/bioinformatics/bti193. [PubMed] [Cross Ref]
  • Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita K, Ueda N, Hamajima M, Kawasaki T, Kanehisa M. KEGG as a glycome informatics resource. Glycobiology. 2006;16(5):63R–70R. doi: 10.1093/glycob/cwj010. [PubMed] [Cross Ref]
  • Doubet S, Albersheim P. CarbBank. Glycobiology. 1992;2(6):505–507. doi: 10.1093/glycob/2.6.505. [PubMed] [Cross Ref]
  • Bach F, Thibaux R, Jordan M. Computing regularization paths for learning multiple kernels. Adv Neural Inform Process Syst. 2005;17:73–80.

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central