PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2011 January 1; 27(1): 118–126.
Published online 2010 October 26. doi:  10.1093/bioinformatics/btq569
PMCID: PMC3008636

Optimized data fusion for K-means Laplacian clustering

Abstract

Motivation: We propose a novel algorithm to combine multiple kernels and Laplacians for clustering analysis. The new algorithm is formulated on a Rayleigh quotient objective function and is solved as a bi-level alternating minimization procedure. Using the proposed algorithm, the coefficients of kernels and Laplacians can be optimized automatically.

Results: Three variants of the algorithm are proposed. The performance is systematically validated on two real-life data fusion applications. The proposed Optimized Kernel Laplacian Clustering (OKLC) algorithms perform significantly better than other methods. Moreover, the coefficients of kernels and Laplacians optimized by OKLC show some correlation with the rank of performance of individual data source. Though in our evaluation the K values are predefined, in practical studies, the optimal cluster number can be consistently estimated from the eigenspectrum of the combined kernel Laplacian matrix.

Availability: The MATLAB code of algorithms implemented in this paper is downloadable from http://homes.esat.kuleuven.be/~sistawww/bioi/syu/oklc.html.

Contact: ude.ogacihcu@uyihs

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Clustering is a fundamental problem in unsupervised learning and a number of different algorithms and methods have emerged over the years. K-means (KM) and spectral clustering are two popular methods for clustering analysis. K-means is proposed to cluster attribute-based data into K numbers of clusters with the minimal distortion (Bishop, 2006; Duda et al., 2001). Another well-known method, spectral clustering (SC) (Ng et al., 2001; Shi and Malik, 2000), is also widely adopted in many applications. Unlike KM, SC is specifically developed for graphs, where the data samples are represented as vertices connected by non-negatively weighted undirected edges. The problem of clustering on graphs belongs to another paradigm than the algorithms based on the distortion measure. The goal of graph clustering is to find partitions on the graph such that the edges between different groups have a very low weight (von Luxburg, 2007). To model this, different objective functions are adopted and the typical criteria include the RatioCut (Hagen and Kahng, 1992), the normalized cut (Shi and Malik, 2000) and many others. To solve these objectives, the discrete constraint of the clustering indicators is usually relaxed to real values; thus, the approximated solution of spectral clustering can be obtained from the eigenspectrum of the graph Laplacian matrix. Many investigations (e.g. Dhillon et al., 2004) have shown the connection between KM and SC. Moreover, in practical applications, the weighted similarity matrix is often used interchangeably as the kernel matrix in KM or the adjacency matrix in SC.

Recently, a new algorithm, Kernel Laplacian (KL) clustering , is proposed to combine a kernel and a Laplacian simultaneously in clustering analysis (Wang et al., 2009). This method combines the objectives of KM and SC in a quotient trace maximization form and solves the problem by eigen-decomposition. KL is shown to empirically outperform KM and SC on real datasets. This straightforward idea is useful to solve many practical problems, especially those pertaining to combine attribute-based data with interaction-based networks. For example, in web analysis and scientometrics, the combination of text mining and bibliometrics has become a standard approach in clustering science or technology fields toward the detection of emerging fields or hot topics (Liu et al., 2010). In bioinformatics, protein–protein interaction network and expression data are two of the most important sources used to reveal the relevance of genes and proteins with complex diseases. Conventionally, the data are often transformed into similarity matrices or interaction graphs, then consequently clustered by KM or SC. In KL, the similarity-based kernel matrix and the interaction- based Laplacian matrix are combined, which provides a novel approach to combine heterogeneous data structures in clustering analysis.

Our preliminary experiments show that when using KL to combine a single kernel and a single Laplacian, its performance strongly depends on the quality of the kernel and the Laplacian, which results in a model selection problem to determine the optimal settings of the kernel and the Laplacian. To perform model selection on unlabeled data is non-trivial because it is difficult to evaluate the models. To tackle the new problem, we propose a novel algorithm to incorporate multiple kernels and Laplacians in KL clustering. Our recent work proposes a method to integrate multiple kernel matrices in kernel k-means clustering (Yu,S. et al. Optimized data fusion for kernel K-means clustering, submitted for publication). The main contribution of the present work lies in the additive combination of multiple kernels and Laplacians; moreover, the coefficients assigned to the kernels and the Laplacians are optimized automatically. This article presents the mathematical derivations of the additive integration form of kernels and Laplacians. The optimization of coefficients and clustering are achieved via a solution based on bi-level alternating minimization (Csiszar and Tusnady, 1984). We validate the proposed algorithm on heterogeneous datasets taken from two real applications, where the advantage and reliability of the proposed method are systematically compared and demonstrated.

2 METHODS

2.1 Combine kernel and Laplacian as generalized Ralyeigh quotient for clustering

We first briefly review the KL algorithm proposed by Wang et al. (2009). All the mathematical symbols used in the article are consistent and their representations are listed in Supplementary Material 1. Let us denote X as an attribute dataset and W as a graph affinity matrix, both of them are representations of the same sets of samples. The objective of the KL integration to combine X and W for clustering can be defined as

equation image
(1)

where JSC and JKM are, respectively, the objectives of SC and KM clustering, κ [set membership] [0, 1] is a coefficient adjusting the effect of the two objectives. Let us denote A [set membership] RN×K as the weighted scalar cluster membership matrix, given by

equation image
(2)

where nb is the number of data points belonging to cluster Cb and ATA = IK, where IK denotes a K × K identity matrix. Let us denote D as the diagonal matrix whose (a, a) entry is the sum of the entries of row a in the affinity matrix W. The normalized Laplacian matrix (von Luxburg, 2007) is given by

equation image
(3)

The objective of normalized cut-based SC is formulated as

equation image
(4)

As discussed in the literature (Bishop, 2006; Duda et al., 2001; Hastie et al., 2009), if the data X has zero sample means, the objective of the KM is given by

equation image
(5)

We further generalize (5) by applying the feature map ϕ(·):RF on X, then the centered data in F is denoted as XΦ, given by

equation image
(6)

where An external file that holds a picture, illustration, etc.
Object name is btq569i1.jpg is the feature map applied on the column vector of the i-th data point in F, An external file that holds a picture, illustration, etc.
Object name is btq569i2.jpg is the global mean in F (Girolami, 2002). The inner product XT X in (5) can be combined using the kernel trick An external file that holds a picture, illustration, etc.
Object name is btq569i3.jpg, where G(·, ·) is a Mercer kernel. We denote Gc as the centered kernel matrix as Gc = PGP, where P is the centering matrix An external file that holds a picture, illustration, etc.
Object name is btq569i4.jpg, G is the kernel matrix, IN is the N × N identity matrix, An external file that holds a picture, illustration, etc.
Object name is btq569i5.jpg is a column vector of N ones. Without loss of generality, the KM objective in (5) can be equivalently written as

equation image
(7)

Then the objective of KL integration becomes

equation image
(8)

To solve the optimization problem without tuning the ad hoc parameter κ, Wang et al. formulate it as a trace quotient of the two components (Wang et al., 2009). The trace quotient is then further relaxed as a maximization of quotient trace, given by

equation image
(9)

The problem in (9) is a generalized Rayleigh quotient and the optimal solution A* is obtained in the generalized eigenvalue problem. To maximize this objective, A* is approximated as the largest K eigenvectors of An external file that holds a picture, illustration, etc.
Object name is btq569i6.jpg, where An external file that holds a picture, illustration, etc.
Object name is btq569i7.jpg is the pseudo inverse of An external file that holds a picture, illustration, etc.
Object name is btq569i8.jpg (Wang et al., 2009).

2.2 Combine kernel and Laplacian as additive models for clustering

As discussed, the original KL algorithm is proposed to optimize the generalized Rayleigh quotient objective. In this article, we propose an alternative integration method using a different notation of Laplacian (von Luxburg, 2007), An external file that holds a picture, illustration, etc.
Object name is btq569i9.jpg, given by

equation image
(10)

where D and W are defined the same as in (3). The objective of spectral clustering is equivalent to maximizing the term as

equation image
(11)

Therefore, the objective of the KL integration can be rewritten in an additive form, given by

equation image
(12)

where A, Gc are defined the same as in (8), κ is the free parameter to adjust the effect of kernel and Laplacian in KL integration. If κ is pre−defined, (12) is a Rayleigh quotient problem and the optimal A* can be obtained from eigenvalue decomposition, known as the spectral relaxation (Ding and He, 2004). Therefore, to maximize this objective, we denote An external file that holds a picture, illustration, etc.
Object name is btq569i10.jpg thus A* is solved as the dominant K eigenvectors of Ω.

In Sections 2.1 and 2.2, two different methods are investigated to integrate a single Laplacian matrix with a single kernel matrix for clustering, where the main difference is to either optimize the cluster assignment affinity matrix A as a generalized Rayleigh quotient (ratio model) or as a Rayleigh quotient (additive model). The main advantage of the ratio-based solution is to avoid tuning the parameter κ. However, since the main contribution of this article is to optimize the combination of multiple kernels and Laplacians, the coefficients assigned on each kernel and Laplacian still need to be optimized. Moreover, the optimization of the additive integration model is computationally simpler than optimizing the ratio-based model. Therefore, in the following sections we will focus on extending the additive KL integration to multiple sources.

2.3 Clustering by multiple kernels and Laplacians: an additive model solved with bi-level optimization

Let us denote a set of graphs as Hi, i [set membership] {1,…, r}, all having N vertices, and a set of Laplacians An external file that holds a picture, illustration, etc.
Object name is btq569i11.jpg constructed from Hi as (10). Let us also denote a set of centered kernel matrices as Gcj, j [set membership] {1,…, s} with N samples. To extend (12) by incorporating multiple kernels and Laplacians for clustering, we propose a strategy to learn their optimal-weighted convex linear combinations. The extended objective function is then given by

equation image
(13)

where θ1,…, θr and θr+1,…, θr+s are, respectively, the optimal coefficients assigned to the Laplacians and the kernels. G and An external file that holds a picture, illustration, etc.
Object name is btq569i12.jpg are, respectively, the combined kernel matrix and the combined Laplacian matrix. The κ parameter in (12) is replaced by the coefficients assigned on each individual data sources.

To solve Q1, in the first phase we maximize JQ1 with respect to A, keeping An external file that holds a picture, illustration, etc.
Object name is btq569i13.jpg fixed (initialized by random guess). In the second phase, we maximize JQ1 with respect to An external file that holds a picture, illustration, etc.
Object name is btq569i14.jpg, keeping A fixed. The two phases optimize the same objective and repeat until convergence locally. When An external file that holds a picture, illustration, etc.
Object name is btq569i15.jpg is fixed, denoting An external file that holds a picture, illustration, etc.
Object name is btq569i16.jpg, Q1 is exactly a Rayleigh quotient problem and the optimal A* can be solved as a eigenvalue problem of Ω. When A is fixed, the problem reduces to the optimization of the coefficients θl with given cluster memberships. In Supplementary Material 2, we show that when the A is given, Q1 can be formulated as Kernel Fisher Discriminant (KFD) in the high-dimensional feature space F. We introduce An external file that holds a picture, illustration, etc.
Object name is btq569i17.jpg, a projection matrix determining the pairwise discriminating hyperplane. Since the discriminant analysis is invariant to the magnitude of An external file that holds a picture, illustration, etc.
Object name is btq569i18.jpg, we assume that WTW = IK, thus Q1 can be equivalently formulated as

equation image
(14)

The bi-level optimization to solve Q1 corresponds to two steps to solve Q2. In the first step (clustering), we set W = Ik and optimize A, which is exactly the additive kernel Laplacian integration as (12); in the second step (KFD), we fix A and optimize W and An external file that holds a picture, illustration, etc.
Object name is btq569i19.jpg. Therefore, the two components optimize toward the same objective as a Rayleigh quotient in F so the iterative optimization converges to a local optimum. Moreover, in the second step, we are not interested in the separating hyperplane defined in W, instead, we only need the optimal coefficients θl assigned on the Laplacians and the kernels. It is known that Fisher discriminant analysis is related to the least squares approach (Duda et al., 2001), and the KFD (Mika et al., 1999) is related to and can be solved as a least squares support vector machine (LS-SVM), proposed by (Suykens et al., 2002). The problem of optimizing multiple kernels for supervised learning (MKL) has been studied by Lanckriet et al. (2004) and Bach et al. (2004). In our recent work Yu et al. (2010b), we derive the MKL extension for LSSVM and propose some efficient solutions to solve the problem. In this article, the KFD problems are formulated as LSSVM MKL and solved by semi-infinite programming (SIP; Sonnenburg et al., 2006). The concrete solutions and algorithms are presented in Yu et al. (2010b).

An external file that holds a picture, illustration, etc.
Object name is btq569i25.jpg

2.3.1 Optimize A with given θ

When θ are given, the kernel-Laplacian combined matrix Ω is also fixed; therefore, the optimal A can be found as the dominant K number of eigenvectors of Ω.

2.3.2 Optimize θ with given A

When A is given, the optimal θ assigned on Laplacians can be solved via the following KFD problem

equation image
(15)

In our recent work, we have found that the δ parameter controls the sparseness of source coefficients θ1,…, θr (Yu et al., 2010b). The issue of sparseness in MKL is also addressed by Kloft et al. (2009). When δ is set to 1, the optimized solution is sparse, which assigns dominant values to only one or two Laplacians (kernels) and zero values to the others. The sparseness is useful to distinguish relevant sources from a large number of irrelevant data sources. However, in many applications, there are usually a small number of sources and most of these data sources are carefully selected and preprocessed. Thus, they often are directly relevant to the problem. In these cases, a sparse solution may be too selective to thoroughly combine the complementary information in the data sources. We may thus expect a non-sparse integration method which smoothly distributes the coefficients on multiple kernels and Laplacians and, at the same time, leverages their effects in the objective optimization. We have proved that when δ is set to 2, the KFD step in (15) optimizes the L2-norm of multiple kernels, which yields a non-sparse solution. If we set δ to 0, the cluster objective is simplified as to averagely combine multiple kernels and Laplacians. In this article, we set δ to three different vales (0, 1, 2) to, respectively, optimize the sparse, average and non-sparse coefficients on kernels and Laplacians.

When δ is set to 1, the KFD problem in Q3 is solved as LSSVM MKL (Yu et al., 2010b), given by

equation image
(16)

where An external file that holds a picture, illustration, etc.
Object name is btq569i20.jpg is the vector of dual variables, t is a dummy variable in optimization, a is the index of data samples, b is the cluster label index of the discriminating problem in KFD, Yb is the diagonal matrix representing the binary cluster assignment, the vector on the diagonal of Yb is equivalent to the b-th column of an affinity matrix Fab using {+1, −1} to discriminate the cluster assignments, given by

equation image
(17)

The problem presented in Q4 has an efficient solution based on SIP, which is presented in Equation forty-one of (Yu et al., 2010b). The optimal coefficients θi correspond to the dual variables bounded by the quadratic constraint An external file that holds a picture, illustration, etc.
Object name is btq569i21.jpg in (16). When δ is set to 2, the solution to Q3 is given by

equation image
(18)

where An external file that holds a picture, illustration, etc.
Object name is btq569i22.jpg, other variables are defined the same as (16). The problem Q5 also has an efficient solution presented in Equation forty-two in our recent work (Yu et al., 2010b). The main difference between Q4 and Q5 is that Q4 optimizes the L norm of multiple kernels, whereas Q5 optimizes the L2 norm. The optimal coefficients solved by Q4 are more likely to be sparse; in contrast, the ones obtained by Q5 are non-sparse. The algorithm to solve Q4 and Q5 is concretely explained in Algorithm 0.2 in Yu et al. (2010b).

Analogously, the coefficients assigned on kernels can also be obtained in the similar formulation, given by

equation image
(19)

where most of the variables are defined in the similar way as Q3 in (15). The main difference is that the Laplacian matrices An external file that holds a picture, illustration, etc.
Object name is btq569i23.jpg and An external file that holds a picture, illustration, etc.
Object name is btq569i24.jpg are replaced by the centered kernel matrices G and Gcj. The solution of Q6 is exactly the same as Q3, depending on the δ value, it can be solved either as Q4 or Q5.

2.3.3 Algorithm: optimized kernel Laplacian clustering

As discussed, the proposed algorithm optimizes A and θ iteratively to convergence. The coefficients assigned to the Laplacians and the kernels are optimized in parallel. Putting all the steps together, the pseudocode of the proposed optimized kernel Laplacian clustering (OKLC) is presented in Algorithm 2.1.

The iterations in Algorithm 2.1 terminate when the cluster membership matrix A stops changing. The tolerance value ϵ is a constant value as the stopping rule of OKLC, and in our implementation it is set to 0.05. In our implementation, the final cluster assignment is obtained using the KM algorithm on A(γ). In Algorithm 2.1, we consider the δ as predefined values. When δ is set to 1 or 2, the SIP-LSSVM-MKL function optimizes the coefficients as the formulation in (16) or (18), respectively. It is also possible to combine Laplacians and kernels in an average manner. In this article, we compare all these approaches and implement three different OKLC models. These three models are denoted as OKLC model 1, OKLC model 2 and OKLC model 3 which respectively correspond to the objective Q2 in (14) when δ = 1, average combination, δ = 2.

2.4 Datasets and experimental setup

The proposed OKLC models are validated in two real applications to combine heterogeneous datasets in clustering analysis. The datasets in the first experiment is taken from the work of multi-view text mining for disease gene identification (Yu et al., 2010a). The datasets contain nine different gene-by-term text profiles indexed by nine controlled vocabularies. The original disease relevant gene dataset contains 620 genes which are known to be relevant to 29 diseases. To avoid the effect of imbalanced clusters which may affect the evaluation, we only keep the diseases that have 11–40 relevant genes. This results in 14 genetic diseases and 278 genes. Because the present article is focused on non-overlapping (‘hard’) clustering, we further remove 16 genes which are relevant to multiple diseases. The remaining 262 disease-relevant genes are clustered into 14 clusters and evaluated biologically by their disease labels. For each vocabulary-based gene-by-term data source, we create a kernel matrix using the linear kernel function and the kernel normalization method proposed by (Shawe-Taylor and Cristianini, 2004), (Chapter 5). An element in the kernel matrix is then equivalent to the value of cosine similarity of two vectors (Baeza-Yates and Ribeiro-Neto, 1999). This kernel is then regarded as the weighted adjacency matrix to create the Laplacian matrix. In total, nine kernels and nine Laplacian matrices are combined in clustering.

The datasets in the second experiment are taken from Web of Science (WOS) database provided by Thomson Scientific (Liu et al., 2010). After preprocessing, the dataset contains 8305 journals categorized in 22 scientific fields. To create a balanced benchmark data for evaluation, we select seven fields consisting 1421 journals. The titles, abstracts and keywords of the journal publications are indexed by a text mining program using no controlled vocabulary. The weights of terms are calculated using four weighting schemes: TF-IDF, IDF, TF and binary. The citations among journals are also investigated from four different aspects: cross-citation, co-citation, bibliographic coupling and binary cross-citation. The lexical similarities are represented as normalized linear kernel matrices (using the same methods applied on the disease data) and the citation metrics are regarded as weighted adjacency matrices to create the Laplacians. Totally, four kernels and four Laplacians are combined on journal data. The details about the two datasets are presented in Supplementary Material 3.

The datasets used in our experiments are provided with labels; therefore, the clustering performance is evaluated as comparing the automatic partitions with the labels using adjusted rand index (ARI; Hubert and Arabie, 1985) and normalized mutual information (NMI; Strehl and Ghosh, 2002). To evaluate the ARI and NMI performance, we set K = 14 on disease data and K = 7 on journal data. We also tune the OKLC model using different K values.

3 RESULTS

We implement the proposed OKLC models to integrate multiple kernels and Laplacians on disease data and journal set data. To compare the performance, we also apply six popular ensemble clustering methods mentioned in relevant work (Yu et al., 2010a) to combine the partitions of individual kernels and Laplacians as a consolidated partition. These six methods are CSPA (Strehl and Ghosh, 2002), HGPA (Strehl and Ghosh, 2002), MCLA (Strehl and Ghosh, 2002), QMI (Topchy et al., 2005), EACAL (Fred and Jain, 2005) and AdacVote (Ayad and Kamel, 2008). As shown in Tables 1 and and2,2, the performance of OKLC algorithms is better than all the compared methods and the improvement is significant. On disease data, the best performance is obtained by OKLC model 1, which uses sparse coefficients to combine nine text mining kernels and nine Laplacians to identify disease-relevant clusters (ARI: 0.5859, NMI: 0.7451). On journal data, all three OKLC models perform comparably well. The best one seems coming from OKLC model 3 (ARI: 0.7336, NMI: 0.7758), which optimizes the non-sparse coefficients on the four kernels and four Laplacians.

Table 1.
Performance on disease dataset
Table 2.
Performance on journal dataset

To evaluate whether the combination of kernel and Laplacian indeed improve the clustering performance, we first systematically compared the performance of all the individual data sources using KM and SC. As shown in Supplementary Material 4, on disease data, the best KM performance (ARI 0.5441, NMI 0.7099) and SC (ARI 0.5199, NMI 0.6858) performance are obtained on LDDB text mining profile. Next, we enumerate all the paired combinations of a single kernel and a single Laplacian for clustering. The integration is based on Equation (12) and the κ value is set to 0.5 so the objectives of KM and SC are combined averagely. The performance of all 45 paired combinations is presented in Supplementary Material 5. As shown, the best KL clustering performance is obtained by integrating the LDDB kernel with KO Laplacian (ARI 0.5298, NMI 0.6949). Moreover, we also found that the integration performance varies significantly by the choice of kernel and Laplacian, which proves our previous point that the KL performance is highly dependent on the quality of kernel and Laplacian. Using the proposed OKLC algorithm, there is no need to enumerate all the possible paired combinations. OKLC combines all the kernels and Laplacians and optimizes their coefficients in parallel, yielding a comparable performance with the best paired combination of a single kernel and a single Laplacian.

In Figure 1, two confusion matrices of disease data for a single run are depicted. The values on the matrices are normalized according to Rij = Cj/Ti, where Ti is the total number of genes belonging in disease i and Cj is the number of these Ti genes that were clustered to belong to class j. First, it is worth noting that OKLC reduces the number of misclustered genes on breast cancer (Nr.1), cardiomyopathy (Nr.2) and muscular dystrophy (Nr.11). Among the misclustered genes in LDDB, five genes (TSG101, DBC1, CTTN, SLC22A18, AR) in breast cancer, two genes in cardiomyopathy (COX15, CSRP3) and two genes in muscular dystrophy (SEPN1, COL6A3) are correctly clustered in OKLC model 1. Second, there are several diseases where consistent misclustering occurs in both methods, such as diabetes (Nr.6) and neuropathy (Nr.12). The intuitive confusion matrices correspond to the numerical evaluation results; as shown, the quality of clustering obtained by OKLC model 1 (ARI = 0.5898, NMI = 0.7429) is higher than LDDB.

Fig. 1.
Confusion matrices of disease data obtained by kernel KM on LDDB (A) and OKLC model 1 integration (B). The numbers of cluster labels are consistent with the numbers of diseases presented in Supplementary Material 3. In each row of the confusion matrix, ...

The performance of individual data sources of journal data is shown in Supplementary Material 6. The best KM (ARI 0.6482, NMI 0.7104) is obtained on the IDF kernel and the best SC (ARI 0.5667, NMI 0.6807) is obtained on the cross-citation Laplacian. To combine the four kernels with four Laplacians, we evaluate all the 10 paired combinations and show the performance in Supplementary Material 7. The best performance is obtained by integrating the IDF kernel with the cross-citation Laplacian (ARI 0.7566, NMI 0.7702). As shown, the integration of lexical similarity information and citation-based Laplacian indeed improves the performance.

In Figure 2, the confusion matrices (also normalized) of journal data for a single run are illustrated. We compare the best individual data source (IDF with kernel KM, figure on the left) with the OKLC model 1. In the confusion matrix of IDF KM, 79 journals belonging to agriculture science (Nr.1) are misclustered to environment ecology (Nr.3), 9 journals are misclustered to pharmacology and toxicology (Nr.7). In OKLC, the number of agriculture journals misclustered to environment ecology is reduced to 45, and the number to pharmacology and toxicology is reduced to 5. On other journal clusters, the performance of the two models is almost equivalent.

Fig. 2.
Confusion matrices of journal data obtained by kernel KM on IDF (A) and OKLC model 1 integration (B). The numbers of cluster labels are consistent with the numbers of ESI journal categories presented in Supplementary Material 3. In each row, the diagonal ...

We also investigated the performance of combining only multiple kernels or multiple Laplacians. On the disease dataset, we combined the nine kernels and the nine Laplacians for clustering, respectively, using all the compared methods in Tables 1 and and2.2. On the journal dataset, we combine the four text mining kernels and the four citation Laplacians. The proposed OKLC method is simplified as only optimizing coefficients on Laplacians (step 2 in Algorithm 2.1) or kernels (step 3). As shown in Supplementary Material 8, the performance of OKLC is also comparable to the best performance obtained either by kernel combination or Laplacian combination. In particular, of all the methods we compared, the best performance is all obtained on OKLC models or its simplified forms.

It is interesting to observe that the average combination model (OKLC model 2) performs quite well on the journal dataset but not on the disease dataset. This is probably because most of the sources in journal dataset are relevant to the problem, whereas in disease dataset some data sources are noisy, and thus the integration of disease data sources is a non-trivial task. We expect that the other two OKLC models (models 1 and 3) optimize the coefficients assigned on the kernels and the Laplacians to leverage multiple sources in integration and, at the same time, to increase the robustness of the combined model on combining relevant and irrelevant data sources. To evaluate whether the optimized weights assigned on individual sources have correlation with the performance, we compare the rank of coefficients with the rank of performance from Tables 36. As shown, the largest coefficients correctly indicate the best individual data sources. It is worth noting that in multiple kernel learning, the rank of coefficients are only moderately correlated with the rank of individual performance. In our experiments, the MeSH kernel gets the second largest weights though its performance in evaluation is low. In MKL, it is usual that the best individual kernel found by cross-validation may not lead to a large weight when used in combination (Ye et al., 2008). Kernel fusion combines multiple sources at a refined granularity, where the ‘moderate’ kernels containing weak and insignificant information could complement to other kernels to compose a ‘good’ kernel containing strong and significant information. Though such complementary information cannot be incorporated when cross-validation is used to choose a single best kernel, these ‘moderate’ kernels are still useful when combined with other kernels (Ye et al., 2008). Based on the ranks presented in Tables 5 and and6,6, we calculate the Spearman correlations between the ranks of weights and the ranks of performance on both datasets. The correlations of disease kernels, disease Laplacians, journal kernels and journal Laplacians are, respectively, 0.5657, 0.6, 0.8 and 0.4. In some relevant work, the average Spearman correlations are mostly around 0.4 (Lanckriet et al., 2004; Ye et al., 2008). Therefore, the optimal weights obtained in our experiments are generally consistent with the rank of performance.

Table 4.
The average values of coefficients of kernels and Laplacians in journal data set optimized by OKLC model 1

Table 3.
The average values of coefficients of kernels and Laplacians in disease dataset optimized by OKLC model 1
Table 5.
The average values of coefficients of kernels and Laplacians in disease data set optimized by OKLC model 3
Table 6.
The average values of coefficients of kernels and Laplacians in journal dataset optimized by OKLC model 3

As a spectral clustering algorithm, the optimal cluster number of OKLC can be estimated by checking the plot of eigenvalues (von Luxburg, 2007). To demonstrate this, we investigated the dominant eigenvalues of the optimized combination of kernels and Laplacians. In Figure 3, we compare the difference of three OKLC models with the pre-defined K (set as equal to the number of class labels). In practical research, one can predict the optimal cluster number by checking the ‘elbow’ of the eigenvalue plot. As shown in Figure 3, the ‘elbow’ in disease data is quite obvious at the number of 14. In journal data, the ‘elbow’ is more likely to range from 6 to 12. All the three OKLC models show a similar trend on the eigenvalue plot. Moreover, in Supplementary Material 9 we also compare the eigenvalue curves using different K values as input. As shown, the eigenvalue plot is quite stable with respect to the different inputs of K, which means the optimized kernel and Laplacian coefficients are quite independent with the K value. This advantage enables a reliable prediction about the optimal cluster number by integrating multiple data sources.

Fig. 3.
The plot of eigenvalues (A and B) of the optimal kernel-Laplacian combination obtained by all OKLC models. The parameter K is set as equivalent as the reference label numbers.

To investigate the computational time, we benchmark OKLC algorithms with other clustering methods on the two datasets. As shown in Table 7, when optimizing the coefficients, OKLC algorithm (models 1 and 3) spends longer time than the other methods to optimize the coefficients on the Laplacians and the kernels. However, the proposed algorithm is still efficient. Considering the fact that the proposed algorithm yields much better performance and more enriched information (the ranking of the individual sources) than other methods, it is worth spending extra computational complexity on a promising algorithm.

Table 7.
Comparison of CPU time of all algorithms

4 CONCLUSION

In this article, we propose a new clustering approach, OKLC, to optimize the combination of multiple kernels and Laplacians in clustering analysis. The objective of OKLC is formulated as a Rayleigh quotient function and is solved iteratively as a bi-level optimization procedure. In the simplest interface, the proposed algorithm only requires one input parameter, the cluster number K, from the user. Moreover, depending on user's expectation to select the most relevant sources or to evenly combine all sources, the sparseness of coefficient vector θ can be controlled via the parameter δ. In our article, we propose three variants of the OKLC algorithm and validate them on two real applications. The performance of clustering is systematically compared with a variety of algorithms and different experimental settings. The proposed OKLC algorithms perform significantly better than other methods. Moreover, the coefficients of kernels and Laplacians optimized by OKLC show strong correlation with the rank of performance of individual data source. Though in our evaluation the K values are predefined, in practical studies, the optimal cluster number can be consistently estimated from the eigenspectrum of the combined kernel Laplacian matrix.

The proposed OKLC algorithm demonstrates the advantage of combining and leveraging information from heterogeneous data structures and sources. It is potentially useful in bioinformatics and many other application areas, where there is a surge of interest to integrate similarity-based information and interaction-based relationships in statistical analysis and machine learning.

Funding: The work was supported by (i) Research Council KUL: ProMeta, GOA Ambiorics, GOA MaNet, CoE EF/05/006, PFV/10/016 SymBioSys, START 1, Optimization in Engineering(OPTEC), IOF-SCORES4CHEM, several PhD/postdoc & fellow grants; (ii) FWO: G.0302.07(SVM/Kernel), G.0318.05 (subfunctionalization), G.0553.06 (VitamineD), research communities (ICCoS, ANMMM, MLDM); G.0733.09 (3UTR), G.082409 (EGFR); (iii) IWT: PhD Grants, Eureka-Flite+, Silicos; SBO-BioFrame, SBO-MoKa, SBO LeCoPro, SBO Climaqs, SBO POM, TBM-IOTA3, O&O-Dsquare; (iv) Belgian Federal Science Policy Office: IUAP P6/25 (BioMaGNet, Bioinformatics and Modeling: from Genomes to Networks, 2007–2011), IUAP P6/04 (DYSCO, Dynamical systems, control and optimization, 2007-2011); (v) FOD:Cancer plans; (vi) Centre for R&D Monitoring of the Flemish Government; (vii) EU-RTD: ERNSI: European Research Network on System Identification; FP7-HEALTH CHeartED; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT-248940).

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data:

REFERENCES

  • Ayad HG, Kamel MS. Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans. PAMI. 2008;30:160–173. [PubMed]
  • Bach FR, et al. 21st International Conference on Machine Learning. Banff, Alberta: ACM; 2004. Multiple kernel learning, conic duality, and the SMO algorithm; pp. 6–13.
  • Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. ACM press; 1999.
  • Bishop CM. Pattern Recognition and Machine Learning. New York, NY: Springer; 2006.
  • Csiszar I, Tusnady G. Information geometry and alternating minimization procedures. Stat. Decis. 1984;(Suppl. 1):205–237.
  • Dhillon LS, et al. Proceedings of the 10th ACM KDD. Seattle, WA: ACM; 2004. Kernel k-means, spectral clustering and normalized cuts; pp. 551–556.
  • Ding C, He X. 21st International Conference on Machine Learning. Banff, Alberta: ACM; 2004. K-means clustering via principal component analysis; pp. 225–232.
  • Duda RO, et al. Pattern Classification. 2. New York, NY: John Wiley & Sons Inc.; 2001.
  • Fred ALN, Jain AK. Combining multiple clusterings using evidence accumulation. IEEE Trans. PAMI. 2005;27:835–850. [PubMed]
  • Girolami M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 2002;13:780–784. [PubMed]
  • Hagen L, Kahng A. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput. Aided Des. 1992;11:1074–1085.
  • Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; 2009.
  • Hubert L, Arabie P. Comparing partition. J. Classific. 1985;2:193–218.
  • Kloft M, et al. Advances in Neural Information Processing System 22. MIT Press; 2009. Efficient and accurate Lp-norm multiple Kernel learning.
  • Lanckriet G, et al. Learning the kernel matrix with semidefinite programming. J. Machine Learning Res. 2004;5:27–72.
  • Liu X, et al. Weighted hybrid clustering by combining text mining and bibliometrics on large-scale journal database. J. Am. Soc. Inform. Sci. Technol. 2010;61:1105–1119.
  • Mika S, et al. Fisher discriminant analysis with kernels. IEE N.N. Singal. Process. 1999;9:41–48.
  • Ng AY. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing. 2001;14:849–856.
  • Shawe-Taylor J, Cristianini N. Kernel Methods for Pattern Analysis. Cambridge: Cambridge University Press; 2004.
  • Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans. PAMI. 2000;22:888–905.
  • Sonnenburg S, et al. Large scale multiple Kernel learning. J. Mach. Learn. Res. 2006;7:1531–1565.
  • Strehl A, Ghosh J. Cluster ensembles: a knowledge Reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002;3:583–617.
  • Suykens JAK, et al. Least Squares Support Vector Machines. Singapore: World Scientific Publishing; 2002.
  • Topchy A, et al. Clustering ensembles: models of consensus and weak partitions. IEEE Trans. PAMI. 2005;27:1866–1881. [PubMed]
  • von Luxburg U. A tutorial on spectral clustering. Stat. Comput. 2007;17:395–416.
  • Wang F, et al. Proccedings of SDM 09. SIAM Press; 2009. Integrated KL(K-means-Laplacian) clustering: a new clustering approach by combining attribute data and pairwise relations; pp. 38–48.
  • Ye J, et al. Proccedings of the 13th ACM KDD. San Jose, CA: ACM; 2007. Nonlinear adaptive distance metric learning for clustering; pp. 123–132.
  • Ye J, et al. Multi-class discriminant kernel learning via convex programming. J. Mach. Learn. Res. 2008;9:719–758.
  • Yu S, et al. Gene prioritization and clustering by multi-view text mining. BMC Bioinformatics. 2010a;11:1–28. [PMC free article] [PubMed]
  • Yu S, et al. L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinformatics. 2010b;11:1–53. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press