Home | About | Journals | Submit | Contact Us | Français |

**|**J Biomed Biotechnol**|**v.2005(2); 2005**|**PMC1184055

Formats

Article sections

Authors

Related links

J Biomed Biotechnol. 2005; 2005(2): 80–86.

doi: 10.1155/JBB.2005.80

PMCID: PMC1184055

*Abdelali Haoudi: Email: ude.smve@aiduoah

Received 2004 September 9; Revised 2005 February 10; Accepted 2005 February 14.

Copyright Hindawi Publishing Corporation

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is largely smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of features of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. We propose here a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. We present a specific application of functional data analysis (FDA) to a high-throughput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and human T-cell leukemia virus type 1 (HTLV-1)-infected patients samples.

A variety of mass spectrometry-based platforms are currently available for providing information on both protein patterns and protein identity [1, 2]. Specifically, the first widely used such mass spectrometric technique is known as surface-enhanced laser desorption ionization (SELDI) coupled with time-of-flight (TOF) mass spectrometric detection [3, 4, 5]. The SELDI approach is based on the use of an energy-absorbing matrix such as sinapinic acid (SPH), large molecules such as peptides ionize instead of decomposing when subjected to a nitrogen UV laser. Thus, partially purified serum is crystallized with an SPH matrix and placed on a metal slide. Depending upon the range of masses the investigator wishes to study, there are a variety of possible slide surfaces; for example, the strong anion exchange (SAX) or the weak cation exchange (WCX) surface. The peptides are ionized by the pulsed laser beam and then traverse a magnetic-field-containing column. Masses are separated according to their TOFs as the latter are proportional to the square of the mass-to-charge (m/z) ratio. Since nearly all of the resulting ions have unit charge, the mass-to-charge ratio is in most cases a mass. The spectrum (intensity level as a function of mass) is recorded, so the resulting data obtained on each serum sample are a series of intensity levels at each mass value on a common grid of masses (peaks).

Proteomic profiling is a new approach to clinical diagnosis, and many computational challenges still exist. Not only are the platforms themselves still improving, but the methods used to interpret the high-dimensional data are developing as well [6, 7].

A variety of clustering approaches has been applied to high-dimensional genomics and proteomics data [8, 9, 10, 11]. Hierarchical clustering methods give rise to nested partitions, meaning the intersection of a set in the partition at one level of the hierarchy with a set of the partition at a higher level of the hierarchy will always be equal to the set from the lower level or the empty set. The hierarchy can thus be graphically represented by a tree.

Functional data analysis (FDA) is a statistical data analysis
represented by smooth curves or continuous functions *μ _{i}*(

$$\begin{array}{c}{y}_{ij}={\widehat{\mu}}_{i}\left({t}_{ij}\right)={\mu}_{i}\left({t}_{ij}\right)+{\u03f5}_{i}\left({t}_{ij}\right),\\ i=1,\dots ,n,\text{\hspace{0.17em}}\text{\hspace{0.17em}}j=1,\dots ,{T}_{i},\end{array}\phantom{\rule{2em}{0ex}}\left(1\right)$$

where *t _{ij}* is the mass value at which the

We propose to implement a hierarchical clustering algorithm for proteomics data using FDA. We use functional transformation to smooth and reduce the dimensionality of the spectra and develop a new algorithm for clustering high-dimensional proteomics data.

Protein expression profiles generated through SELDI analysis of sera from human t-cell leukemia virus type 1- (HTLV-1)-infected individuals were used to determine the changes in the cell proteome that characterize adult T-cell leukemia (ATL), an aggressive lymphoproliferative disease from HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP), a chronic progressive neurodegenerative disease. Both diseases are associated with the infection of T cells by HTLV-1. The HTLV-1 virally encoded oncoprotein Tax has been implicated in the retrovirus-mediated cellular transformation and is believed to contribute to the oncogenic process through induction of genomic instability affecting both DNA repair integrity and cell cycle progression [14, 15]. Serum samples were obtained from the Virginia Prostate Center Tissue and body fluid bank. All samples had been procured from consenting patients according to protocols approved by the Institutional Review Board and stored frozen. None of the samples had been thawed more than twice.

Triplicate serum samples (*n* = 68) from healthy or normal
(*n _{1}* = 37), ATL (

Serum samples were analyzed by SELDI mass spectrometry as described earlier [16]. The spectral data generated was used in this study for the development of the novel FDA.

We propose to implement a hierarchical clustering algorithm for
proteomics data using FDA, which consists of detecting hidden group structures within a functional dataset. We
apply a new dissimilarity measure to the smoothed (transformed) proteomics functions ${\widehat{\mu}}_{i}$. Then we develop a new metric that calculates the dissimilarity between different curves produced by protein expression. The development of metrics for curve and time-series models was first addressed by Piccolo
[17] and Corduas [18]. Heckman and Zamar proposed a dissimilarity measure *δ _{HZ}* for clustering curves [19]. Their dissimilarity measure considers curve invariance under monotone transformations. Let ${\Lambda}_{i}=\left\{{\lambda}_{1}^{\left(i\right)},{\lambda}_{2}^{\left(i\right)},\dots ,{\lambda}_{{m}_{i}}^{\left(i\right)}\right\}$ be the collection of the estimated points where the curve

$${\delta}_{HZ}\left(i,l\right)=\frac{{\sum}_{j=1}^{{m}_{i}}\left(r\left({\lambda}_{j}^{\left(i\right)}\right)-r\overline{\left({\lambda}^{\left(i\right)}\right)}\right)\left(r\left({\lambda}_{j}^{\left(l\right)}\right)-r\overline{\left({\lambda}^{\left(l\right)}\right)}\right)}{{{\sum}_{j=1}^{{m}_{i}}{\left(r\left({\lambda}_{j}^{\left(i\right)}\right)-r\overline{\left({\lambda}^{\left(i\right)}\right)}\right)}^{2}{\sum}_{j=1}^{{m}_{l}}\left(r\left({\lambda}^{\left(l\right)}\right)-r\overline{\left({\lambda}^{\left(l\right)}\right)}\right)}^{2}},\phantom{\rule{2em}{0ex}}\left(2\right)$$

where

$$\begin{array}{c}r\left({\lambda}_{j}^{\left(i\right)}\right)={k}_{j}^{\left(i\right)}+\frac{{u}_{j}^{\left(i\right)}}{2},\phantom{\rule{1em}{0ex}}{k}_{j}^{\left(i\right)}=\left\{\#i,{\lambda}_{i}^{\left(i\right)}<{\lambda}_{j}^{\left(i\right)}\right\},\\ {u}_{j}^{\left(i\right)}=\left\{\#i,{\lambda}_{i}^{\left(i\right)}={\lambda}_{j}^{\left(i\right)}\right\},\phantom{\rule{2em}{0ex}}\overline{r\left({\lambda}^{\left(i\right)}\right)}=\frac{1}{{m}_{i}}\sum _{j=1}^{{m}_{i}}r\left({\lambda}_{j}^{\left(i\right)}\right).\end{array}\phantom{\rule{2em}{0ex}}\left(3\right)$$

This measure is powerful for regression curves which are mainly
monotone. On the other hand, Cerioli et al [20] propose a
dissimilarity measure *δ _{C}* extending the one proposed by
Ingrassia et al [21]. Cerioli's dissimilarity

$$\begin{array}{c}d\left(i,l\right)=\sum _{j=1}^{{m}_{i}}\frac{\left|{\lambda}_{j}^{\left(i\right)}-{\lambda}_{\ast j}^{\left(l\right)}\right|}{{m}_{i}},\\ {\lambda}_{\ast j}^{\left(l\right)}=\left\{{\lambda}_{{j}^{\prime}}^{\left(l\right)}:\left|{\lambda}_{j}^{\left(i\right)}-{\lambda}_{{j}^{\prime}}^{\left(l\right)}\right|=min,\text{\hspace{0.17em}}i=1,\dots ,n\right\},\\ {\delta}_{C}\left(i,l\right)=\left(\frac{{d}_{il}+{d}_{li}}{2}\right).\end{array}\phantom{\rule{2em}{0ex}}\left(4\right)$$

Both dissimilarity measures show good performance for time-series
data. Dissimilarity *δ _{C}* does not involve all the indices

A flexible dissimilarity measure is the one that may combine the characteristic of both measures *δ _{HZ}* and

In this sense, we propose a functional-based dissimilarity *δ _{B}* measure which uses the rank of the curve proposed by Heckman and Zamar and generalizes Cerioli et al dissimilarity measure as follows:

$$\begin{array}{c}{d}_{il}=\sum _{j=1}^{{m}_{i}}\frac{\left|r\left({\lambda}_{j}^{\left(i\right)}\right)-r\left({\lambda}_{\ast j}^{\left(l\right)}\right)\right|}{{m}_{i}},\\ r\left({\lambda}_{\ast j}^{\left(l\right)}\right)=\frac{{\sum}_{h=1}^{{m}_{l}}\left|r\left({\lambda}_{j}^{\left(i\right)}\right)-r\left({\lambda}_{{h}^{\prime}}^{\left(l\right)}\right)\right|}{{m}_{l}},\\ r\left({\lambda}_{j}^{\left(i\right)}\right)={k}_{j}^{\left(i\right)}+\frac{{u}_{j}^{\left(i\right)}}{2},\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{k}_{j}^{\left(i\right)}=\left\{\#i,{\lambda}_{i}^{\left(i\right)}<{\lambda}_{j}^{\left(i\right)}\right\},\\ {u}_{j}^{\left(i\right)}=\left\{\#i,{\lambda}_{i}^{\left(i\right)}={\lambda}_{j}^{\left(i\right)}\right\},\phantom{\rule{1em}{0ex}}\overline{r\left({\lambda}^{\left(i\right)}\right)}=\frac{1}{{m}_{i}}\sum _{j=1}^{{m}_{i}}r\left({\lambda}_{j}^{\left(i\right)}\right).\end{array}\phantom{\rule{2em}{0ex}}\left(5\right)$$

Obviously, *d _{ii}* = 0 and

$${\delta}_{B}\left(i,l\right)=\left(\frac{{d}_{il}+{d}_{li}}{2}\right).\phantom{\rule{2em}{0ex}}\left(6\right)$$

We used three powerful hierarchical methods to derive clusters or patterns using *δ _{B}* and we compare the performance of

The spectral data were collected from proteomics analysis of a total number of serum samples (*n* = 68) including healthy or normal (*n _{1}* = 37), ATL (

To reduce the dimensionality of the spectral data, we applied FDA by fitting a P-spline curve ${\widehat{\mu}}_{i}\left(t\right)$ to each sample **y**_{i}. P-splines satisfy a penalized residual
sum of squares criterion, where the penalty involves a specified degree of derivation for *μ _{i}*(

The next step performed on the smoothed curves is to find the landmarks or indices *T _{i}*. We collected the first derivative of ${\widehat{\mu}}_{i}\left(t\right)$, say ${\widehat{\mu}}_{i}^{\prime}\left(t\right)$, using a smoothing P-spline function available in

The application of functional data transformation led to the
reduction of the dimensionality of the spectra to half. The size
of mass indices become 12,598. To cluster the reduced data, we
calculated the three dissimilarity matrices *M _{δC}*,

When we removed observation 11, we detected a fewer fuzzy patterns with *δ _{C}* (Figure 4),

For *δ _{B}*, we provided the dendogram of the data using Diana approach (Figure 7). Three clusters were apparent. One well-separated cluster and two overlapped ones. For

To check the performance of our method, we calculated the confusion matrix between the predicted clusters and the clinical clusters using Diana (Table 1) and Clara (Table 2). We find that 3 patients out of 11 were misclassified for cluster 1 (HAM), 6 out of 20 were misclassified for cluster 2 (ATL), and 3 out of 37 were misclassified for cluster 3 (normal). Ham and ATL shared the majority of the misclassified observations which makes sense since both groups gather patients with a disease caused by the same retrospective virus. The error rate of misclassification for both clusters (HAM and ATL) is about 20%. For normal patient, the error rate of misclassification is about 8%. The total rate of misclassification is about 16%.

When we used Clara-based hierarchical cluster algorithm with *δ _{B}*, the classification result has dramatically been improved (Figure 8). The error rate of misclassification is reduced to 7%. The error rate of misclassification between HAM and ATL is about 9%, 5% of normal patients was misclassified. This result shows that a hierarchical

Cancer biomarkers can be used to screen asymptomatic individuals in the population, assist diagnosis in suspected cases, predict prognosis and response to specific treatments, and monitor patients after primary therapy. The introduction of new technologies to the proteome analysis field, such as mass spectrometry, have sparked new interest in cancer biomarkers allowing for more effective diagnosis of cancer by using complex proteomic patterns or for better classification of cancers, based on molecular signatures, respectively. These technologies provide wealth of information and rapidly generate large quantities of data.

Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and diseases biomarkers.

Clustering and analyzing proteomics data has been proven to be a challenging task.

Proteomics data are provided usually as curves or spectra with thousand of peaks. A clustering algorithm based on a matrix of *n* observations (*n* samples which is usually small) and *p* peaks (*p* variables which is usually a large number) will be unsuccessful. A matrix of size (*n* *p*) will be singular and any method based on a matrix *M* (*n* × *p*) will not be robust enough and will induce errors. A clustering algorithm based on a well-chosen dissimilarity matrix (*n* × *n*) is more appropriate and more robust given the relatively moderate size of the matrix.

The use of a smoothing function for the spectra performs better for time series or for monotonic curves. We have previously successfully applied this smoothing function to large-scale proteomics data [25].

The application of Euclidean or Mahalanobis distances for instance may not perform well for this proteomics dataset, since those distances usually successfully applied to a typical data with specific expression, spherical or ellipsoidal (normally distributed data). A new dissimilarity measure has to involve other criteria such as the wealth of data points for each observation and the parallel nature expressed by the proteomics curve (or time series). On the other hand, a robust dissimilarity measure may perform badly on a curve with large data points or peaks.

Functional smoothing of proteomics expression profiles or spectra has proven to be very helpful. This has allowed us to minimize the number of peaks to retain only the ones that passed the performance of the FDA smoothing. In this study, after using FDA, we succeeded in retaining 50% of the smoothed peaks. The FDA with the dissimilarity measure *δ _{B}* shows better performance by comparison to

The two remaining difficulties that naturally arose are (1) to find meaningful peaks that can be used to provide better discrimination between the clusters, (2) to propose the optimal number of clusters instead of choosing them a priori. The model selection criteria might be useful to answer those questions. In fact, model selection scores use two components for selecting the number of variables and the number of clusters in a given density-based cluster analysis. The first term is the lack of fit generally proportional to the likelihood function. The second term
is the penalty term (complexity term). For such proteomics dataset, we propose to use the sum of the negative *δ _{B}* dissimilarity measure between all the observations to their closest medoids as a lack of fit function. The penalty term might be simple to derive but biased using AIC and BIC, for example, or it can be more difficult to derive if one used a more robust method such as information complexity-based criteria.

This work was supported by the SRGP Award by the College of Business, University of Tennessee in Knoxville, by the Leukemia Lymphoma Society, and the National Institutes of Health.

1. Aebersold R, Mann M. Mass spectrometry-based proteomics. *Nature*. 2003;422(6928):198–207. [PubMed]

2. Steen H, Mann M. The ABC's (and XYZ's) of peptide sequencing. *Nat Rev Mol Cell Biol*. 2004;5(9):699–711. [PubMed]

3. Wright Jr G.L. SELDI proteinchip MS: a platform for biomarker discovery and cancer diagnosis. *Expert Rev Mol Diagn*. 2002;2(6):549–563. [PubMed]

4. Reddy G, Dalmasso E.A. SELDI protein chip(R) array technology: protein-based predictive medicine and drug discovery applications. *J Biomed Biotechnol*. 2003;2003(4):237–241. [PMC free article] [PubMed]

5. Tang N, Tornatore P, Weinberger S.R. Current developments in SELDI affinity technology. *Mass Spectrom Rev*. 2004;23(1):34–44. [PubMed]

6. Espina V, Mehta A.I, Winters M.E, et al. Protein microarrays: molecular profiling technologies for clinical specimens. *Proteomics*. 2003;3(11):2091–2100. [PubMed]

7. Zhang H, Yan W, Aebersold R. Chemical probes and tandem mass spectrometry: a strategy for the quantitative analysis of proteomes and subproteomes. *Curr Opin Chem Biol*. 2004;8(1):66–75. [PubMed]

8. Vazquez A, Flammini A, Maritan A, Vespignani A. Global protein function prediction from protein-protein interaction networks. *Nat Biotechnol*. 2003;21(1):697–700. [PubMed]

9. Bensmail H, Haoudi A. Postgenomics: proteomics and bioinformatics in cancer research. *J Biomed Biotechnol*. 2003;2003(4):217–230. [PMC free article] [PubMed]

10. Somorjai R.L, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. *Bioinformatics*. 2003;19(12):1484–1491. [PubMed]

11. Schwartz S.A, Weil R.J, Johnson M.D, Toms S.A, Caprioli R.M. Protein profiling in brain tumors using mass spectrometry: feasibility of a new technique for the analysis of protein expression. *Clin Cancer Res*. 2004;10(3):981–987. [PubMed]

12. Ramsay J.O, Silverman B.W. *Functional Data Analysis*. New York, NY: Springer; 1997.

13. Ramsay J.O, Silverman B.W. *Applied Functional Data Analysis: Methods and Case Studies*. New York, NY: Springer; 2002.

14. Haoudi A, Semmes O.J. The HTLV-1 tax oncoprotein attenuates DNA damage induced G1 arrest and enhances apoptosis in p53 null cells. *Virology*. 2003;305(2):229–239. [PubMed]

15. Haoudi A, Daniels R.C, Wong E, Kupfer G, Semmes O.J. Human T-cell leukemia virus-I tax oncoprotein functionally targets a subnuclear complex involved in cellular DNA damage-response. *J Biol Chem*. 2003;278(39):37736–37744. [PubMed]

16. Adam B.L, Qu Y, Davis J.W, et al. Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. *Cancer Res*. 2002;62(13):3609–3614. [PubMed]

17. Piccolo D. A distance measure for classifying ARIMA models. *Journal of Time Series Analysis*. 1990;11:153–164.

18. Corduas M. La metrica autoregressiva tra modelli ARIMA: una procedura in linguaggio GAUSS. *Quaderni di statistica*. 2000;2:1–37.

19. Heckman N, Zamar R. Comparing the shapes of regression function. *Biometrika*. 2000;87(1):135–144.

20. Cerioli A, Laurini F, Corbellini A, editors. Functional cluster analysis of financial time series. In: Proceedings of the Meeting of Classification and Data
Analysis Group of the Italian Statistical Society (CLADAG 2003); Bologna, Italy: CLUEB. 2003. pp. 107–110.

21. Ingrassia S, Cerioli A, Corbellini A. Some issues on clustering of functional data. In: Schader M, Gaul W, Vichi M, editors. *Between Data Science and Applied Data Analysis*. Berlin, Germany: Springer; 2003. pp. 49–56.

22. Kaufman L, Rousseeuw P.J. *Finding Groups in Data. An Introduction to Cluster Analysis*. New York, NY: John Wiley & Sons; 1990.

23. Hastie T.J, Tibshirani R.J. *Generalized Additive Models*. London UK: Chapman & Hall; 1990.

24. Silverman B.W. Some aspects of the spline smoothing approach to nonparametric regression curve fitting. *J Roy Statist Soc B*. 1985;47:1–52.

25. Bensmail H, Semmens J, Haoudi A. Bayesian fast-Fourier transform based clustering method for proteomics data. *Journal of Bioinformatics*. In press.

Articles from Journal of Biomedicine and Biotechnology are provided here courtesy of **Hindawi**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |