Spectral clustering in SCPS
The goal of SCPS is to infer homology relations between protein sequences based on pairwise sequence information only. The input dataset thus consists of either a set of protein sequences or a list of pairwise similarity scores between some protein domains. The output is a partition of the sequences such that each sequence is assigned to one and only one of the partitions in a way that the partitions represent groups of homologs.
A typical SCPS workflow starts with either a FASTA file containing sequences for the protein domains of interest, or a list of BLAST E-values for all pairs of proteins where significant sequence similarity was reported by BLAST. Besides spectral clustering [4
], SCPS currently supports connected component analysis, hierarchical clustering and TribeMCL [7
], and more algorithms will be added in the near future. The spectral clustering approach reformulates the problem of protein homology detection into that of finding an optimal partition of a weighted undirected graph G
. Each vertex of the graph corresponds to a protein sequence. Vertices are connected by undirected, weighted edges, each edge denoting a similarity relation between the two proteins it connects. The weight (label) of the edge is related to the probability of evolutionary relatedness. Edges with large weight are more likely to appear between domains of the same superfamily, hence the problem of partitioning the graph into subsets of vertices with mostly heavy-weight edges is an equivalent formulation of the original protein sequence classification problem. Spectral clustering solves the problem of finding the optimal partition by examining random walks on the similarity graph [14
Our approach is based on the spectral clustering algorithm of [15
]. The general workflow is depicted on Figure . The basic steps of the algorithm are as follows:
Figure 1 The steps of the spectral clustering algorithm. The main steps of the spectral clustering algorithm. Cylinders represent possible input file types, boxes represent processing steps. The numbers in the boxes refer to the steps of the algorithm as described (more ...)
1. If the input file is a FASTA sequence file, we conduct an all-against-all matching using BLAST and store the E-values.
2. Given the pairwise BLAST E-values obtained either from the previous step or directly from the input file, we build an affinity matrix based on a non-linear transformation from E-values to similarity scores. The matrix element in row i and column j contains the E-value corresponding to protein j when protein i was used as a query sequence.
3. Since the BLAST E-value corresponding to a query protein i matching protein j in the database is not necessarily equal to the case when the query protein j matches protein i in the same database, the affinity matrix has to be symmetrized. To obtain a symmetric matrix, we take the higher similarity score (i.e. the smaller E-value) in case of ambiguity. Let sij denote the symmetrized similarity score between protein i and protein j. The sij values together constitute the symmetrized affinity matrix S, whose main diagonal contains only ones.
4. We conduct a preliminary connected component analysis on the graph represented by the affinity matrix S to identify small connected components containing less than five sequences. It is unlikely that these components should be subdivided further, therefore we remove the rows and columns corresponding to these sequences from S, obtaining a reduced matrix S'.
5. We construct a symmetric matrix L
, where D
is a diagonal matrix formed of the vertex degrees (
), and find the eigenvectors corresponding to the K
largest eigenvalues of L
. Let us denote these eigenvectors by u1
, ..., uK
6. We build a matrix U s.t. the kth column of U is uk and normalize the rows of the matrix such that each row in U has unit length.
7. Treating the rows of U
as points in the k
-dimensional Euclidean space K
, we conduct a k-means clustering of these points into K
clusters. The initial centroid positions are chosen from the data points themselves, placed as orthogonally to each other as possible.
8. We assign node i in the original graph to cluster j if and only if row i of Y was assigned to cluster k in the previous step. Small connected components obtained in step 4 are also merged back into the dataset in this final stage.
An important advantage of this method is that the number of clusters (K) can be selected automatically by evaluating the eigenvalues of S'. In our implementation, K is set to the smallest integer k such that λk/λk+1 > ε. ε is adjustable and it is chosen to be 1.02 by default. The main role of ε is to control the granularity of the clustering obtained: larger ε values tend to produce more fine-grained clusters, while a smaller ε yields only a few large clusters. We found that the default choice works well in a wide variety of biological problems (see the Results section). Another way to control the granularity of the clustering is to override K manually either before the clustering process or after the eigenvalue calculation. Both methods are facilitated by the SCPS user interface.
The clustering results are presented in a separate window (see Figure ) where the user can examine and draw the clusters one by one, calculate various quality measures (e.g., mass fraction [16
] and modularity [17
]), visualize the heatmap of the rearranged similarity matrix or export the results in plain text or XGMML format. XGMML files can later be loaded into Cytoscape to facilitate further visualisation and analysis. The heatmap of the rearranged similarity matrix can also be exported in publication quality to a PNG file. SCPS can also retrieve human-readable protein descriptions based on GI numbers from NCBI to aid the interpretation of the results. Clusters are drawn using the Fruchterman-Reingold layout algorithm [18
], a force-directed iterative layout algorithm where nodes are considered as tiny particles that repel each other, while edges represent springs that pull the endpoints of the edge closer. The strength of the attraction force is proportional to the similarity score used in our analyses, hence the obtained layout will tend to place highly similar pairs of proteins close to each other. Figure shows an example of a cluster drawing produced by SCPS.
Figure 2 The result viewer. The result viewer of the graphical user interface showing the heatmap of the rearranged similarity matrix based on the calculated clustering. The individual clusters and other quality measures can be displayed by clicking on the appropriate (more ...)
Visualisation of a cluster calculated by SCPS. Visualisation of a cluster as calculated and exported by SCPS.
Finally, SCPS includes a command line interface which runs the clustering without user intervention and writes the results to the standard output or to a specific output file. This enables the integration of SCPS in batch processing pipelines.
SCPS uses the ARPACK library [19
] for eigenvector calculations. The ARPACK library implements the implicitly restarted Arnoldi method for eigenvector calculations, which is an iterative process that is able to calculate all the eigenvectors and eigenvalues or only the top K
ones. When one can provide a reasonable upper estimate on the number of clusters, the Arnoldi method is much more efficient than standard methods that solve the eigenvector equation directly. On the other hand, the convergence of iterative methods is affected negatively in the presence of eigenvalues with multiplicity greater than one. The multiplicity of the top eigenvalue of the affinity matrix S
is equal to the number of connected components in the input graph. Therefore, we first eliminate small connected components of size less than five sequences from the original graph (they will not be subdivided further) and then connect the remaining components by a small amount of random edges with weight less than 0.01. This decreases the multiplicity of the top eigenvalue to one and thus improve the stability of the eigenvector calculation process without affecting the final result.
The number of clusters can be selected using one of the following methods in our implementation:
• Automatic. This method uses the eigengaps to select the appropriate number of clusters. K is set to the smallest integer k such that λk/λk+1 > ε. ε is adjustable and it is chosen to be 1.02 by default.
• Bounded from above. This method is similar to the automatic selection, but it considers at most a given number of clusters. It takes advantage of the fact that the complete eigenspectrum is not needed in this case when using the spectral clustering, saving time and resources during the computation. If the maximum number of clusters is Kmax, SCPS will compute only the top Kmax eigenvalues and the corresponding eigenvectors.
• Exactly. The user can select the desired number of clusters either before the analysis or after the calculation of the eigenvalues and eigengaps.
Transforming BLAST E-values to similarities
A crucial step in the application of spectral clustering methods in the context of protein sequences is the transformation from BLAST E-values to similarities. SCPS uses an approach based on the statistical analysis of E-values within and between SCOP superfamilies. A randomly selected set of 10,000 E-values chosen from sequences within the same superfamily and 10,000 E-values chosen from sequences in different superfamilies were used to train a logistic regression model that discriminates between intra-superfamily and inter-superfamily E-values. The posterior probability returned by the model on any E-value is then interpreted as the probability of evolutionary relatedness. In case of asymmetric E-values for a pair of proteins, the lower E-value (i.e., the higher probability) is used. The proteins used for training the logistic regression model were not used later in performance assessments of the algorithm.
This section describes the various quality measures we implemented in SCPS. In the following subsections, we will use the following notations:
• sij is the similarity value labelling the edge between vertex i and j in a graph G. sij = sji since we always symmetrize the initial similarity values.
• δij is 1 if vertices i and j are within the same cluster, zero otherwise.
We will also need the following definitions:
Definition 1 (Vertex weight) The weight of vertex i is the sum of the weight (similarity) of all its adjacent edges:di = Σjsij.
Definition 2 (Cluster weight) The weight of cluster i is the sum of the weight of all the edges that lie fully within cluster i (i.e., both their endpoints are in cluster i).
The mass fraction [16
] is an internal quality measure of a clustering on a given graph G
. Intuitively, a clustering is good if the total weight of its clusters is comparable to the total weight of the whole network; in other words, most of the heavy-weight edges are within clusters. The mass fraction simply denotes the fraction of edge weights that is concentrated inside the clusters.
Definition 3 (Mass fraction) The mass fraction of a clustering defined by dij is given as follows:
A disadvantage of this measure is that it attains its maximum when all the vertices are in the same cluster, hence the mass fraction alone cannot be used to decide whether a given clustering is better than another.
] is another internal quality measure of a clustering on a given graph G
. The idea is that it is not enough for a clustering to be good when it contains much of the edge weights within the clusters; the clustering is good when it contains more
weight within the clusters than what we would expect if we rearranged the edges of the graph randomly while keeping the vertex weights constant. Therefore, the difference between the actual cluster weight and the expected cluster weight after such rearrangement is a good indicator of the general quality of the clustering. This measure also avoids the problem with trivial clusterings: a cluster containing all the vertices will contain exactly the same weight before and after rewiring as all the edges will stay within the same cluster, so the modularity score will be zero. Similarly, a clustering where each vertex is in its own cluster will also yield zero modularity as there are no intra-cluster edges at all.
Formally, the modularity score of a clustering is the normalized difference between the actual weight of the clusters and the expected cluster weight after a random rewiring that preserves the vertex weights. It can be shown that the expected weight of the edge between vertex i
after rewiring is
, where m
is the sum of all edge weights in the graph (m
= Σi ≥ jsij
]. The modularity formula then follows easily:
Definition 4 (Modularity) Let δij be 1 if vertices i and j are in the same cluster and zero otherwise. The modularity of the partition defined by δ is then as follows:
Positive modularity then means that there is more weight concentrated within the clusters than what we would expect from a completely random graph with the same vertex weight distribution.
Heatmap of the rearranged similarity matrix
This quality measure is not a single numeric value, but it provides a visual cue to the goodness of a clustering result. The basic idea is that the initial similarity matrix can be plotted as a greyscale heatmap where each pixel corresponds to a single cell of the matrix and the intensity of the pixel is proportional to the weight that the corresponding cell in the matrix represents. The rows and columns of the similarity matrix can be arranged in arbitrary order, but by arranging them in a way that rows and columns corresponding to the same cluster are next to each other, one can uncover a block-diagonal structure in the matrix if the clustering is good.