Definition of the algorithmic problem
On the intuitive level, the algorithmic problem is as follows. First, divide a given multiple sequence alignment into subfamilies (also called sequence clusters) such that each subfamily has a characteristic conservation signature at a number of sequence positions. Then, optimize the information in the subfamily division to achieve a reasonable compromise between the number of proteins in a subfamily and the number of characteristic residues positions used to distinguish the subfamilies from each other (the larger the number of proteins per subfamily, the smaller the number of characteristic residue positions, and vice versa; the two extremes of 'one sequence per subfamily' and 'all sequences in a single subfamily' are uninformative).
To solve this problem, one must introduce a measure to compare different distributions of sequences into subfamilies. The simplest measure is additive for the columns in the alignment. This means that the distribution of residues in alignment columns within a subfamily is treated independently (all possible permutations of residues in a column within a subfamily are equivalent). The total number of permutations in a column i
of a subfamily k
is given by a simple combinatorial formula [29
Here Nk is the number of sequences in subfamily k; Nα,i,k is the number of residues of the type α in column i of subfamily k. (Gaps are taken into account as a separate residue type; α = 21 corresponds to a gap.) The numerator is the total number of permutations of Nk symbols and the product in the denominator divides out the number of indistinguishable permutations for each residue type α.
We then use the statistical or combinatorial entropy [29
is an additive measure (both in terms of alignment columns and subfamilies) for comparing different distributions of residues. The statistical entropy depends on subfamily size. The entropy of the union of two subfamilies is always greater than or equal to the sum of entropies of the individual subfamilies. The entropy is equal to zero when all sequences are separated into subfamilies of a single sequence each (maximal fragmentation); the entropy is maximal when all sequences are united in one family (maximal unification). The dependence of the statistical entropy on subfamily sizes allows one to formulate an optimization problem, namely find the distribution of sequences into subfamilies that is maximally different from a random distribution of sequences. Subfamilies of sequences with many conserved residue patterns (which change across subfamilies) will contribute the most to the optimal solution.
We define specificity residues (also called characteristic or key residues) as residues that are conserved in a subfamily but differ between subfamilies. Thus, one is challenged to determine simultaneously the best division of the set of sequences into subfamilies and the subset of residues that best discriminates between these subfamilies. 'Best' is defined in terms of a contrast function that aims to measure the degree to which the specificity residues are distinctly different in each subfamily. The value of the contrast function is minimal for the best solution, with the result reported as a set of specificity residues and corresponding sequence subfamilies. The sections below describe the contrast function, the meaning of 'best', the optimization algorithm, and a criterion for selecting the top-ranked specificity residues.
Definition of the contrast function in terms of combinatorial entropy
Suppose a multiple alignment is divided into subfamilies or clusters of sequences. For each column i
= 1, ..., L
) of the alignment, one can compute the combinatorial entropy Si
, as defined by Equation 3 (above). At one extreme, the column-specific Si
is zero if residues of one type populate this column in each of the clusters, no matter whether this residue type is the same in all clusters or differs between clusters (for example, see the specificity residue columns in Figure ). So Si
= 0 for completely conserved residues or perfect specificity residues in column i
. At the other extreme, for uniformly distributed residues, Si
has a maximal value given by the background entropy
is the expected number of the residues of a type α
in the column i
of the subfamily k
, provided that all the residues in the column are uniformly mixed (across column boundaries), namely where
is the number of residues of type α in column i
is the total number of sequences (lines) in alignment. (Because
can be noninteger numbers,
! is computed using the relation X
! = Γ(X
+ 1) [30
As the numerical measure of order over disorder, the entropy difference ΔSi
between the observed and uniformly mixed distribution, summed over all L
columns of the alignment:
is the contrast function to be minimized in the process of finding the best decomposition into subfamilies. (Because ΔS0 is a negative number, this means that the absolute value of ΔS0 is maximized.)
The optimization algorithm
A straightforward solution to the optimization problem would be to enumerate all possible partitionings of the set of sequences into subfamilies, calculate the combinatorial entropy difference (the contrast function) as in Equation 6, and then choose the partitioning with the lowest value of ΔS0
. The only problem with this approach is that the number of partitionings of N
sequences into K
clusters is astronomically large for all but very small values of N
. One therefore needs an effective strategy for exploring a reasonable subset of partitonings with the aim of finding one with a value of the contrast function close to the global optimum. Often such complex value landscapes are explored using stochastic algorithms, which can be used in future implementations. In this report we use a simple deterministic hierarchical clustering method [19
] with each clustering step guided by evaluation of a guide function (Equation 7) for all alternative choices in that step.
Starting from N
clusters, each containing one sequence, in each clustering step all pairs of clusters are considered as merger candidates. The pair of clusters with the lowest value of the guide function is merged into one cluster. The merger steps are repeated until all sequences are in one cluster. At this stage the result is a complete trajectory of merger steps, which can be represented as a tree (not shown) and the task is to choose the best partioning (tree level). The best partioning is defined as the one with the minimal value of ΔS0
, or the maximal absolute value of the combinatiorial entropy difference between the actual and uniformly mixed ('random') distribution of residue types (Equation 6). The complexity of the hierarchical clustering algorithm is of O
(N**2 ln N
), where N
is the number of sequences in the multiple alignment [31
To explore different partitionings of sequences into subfamilies, the guide function includes a penalty term [32
]. The penalty term affects the clustering trajectory by favoring mergers that result in smaller clusters over those that result in larger clusters. To explore a larger space of alternative partionings, we perform hierarchical clustering for different relative weights of the penalty term.
The guide function used to evaluate a particular clustering step (potential merger of clusters k and m) is defined as follows:
The first term, ΔSk,m, is the entropy difference computed for the new cluster resulting from the merger of clusters k and m:
averaged over all L columns of the alignment.
The second term,
, the penalty term, makes reference to the combinatorial entropy of an ideal system of the same size:
Where Nk and Nm are the number of sequences in the corresponding clusters k and m.
ΔS'k,m is the maximal possible value of the combinatorial entropy (per column) after merging clusters of size Nk and Nm. This second term simply captures the mere size contribution to the entropy and counteracts the tendency toward trajectories with early emergence of dominant large clusters. This tendency is due to the fact that the entropy of a larger system is always greater than the sum of the entropy values of its subsystems. Whatever the trajectories explored and whatever the devices used to guide the exploration of trajectory space, the evaluation of best partitioning is exclusively based on the combinatorial entropy difference of Equation 6.
Extreme values of the granularity parameter A in Equation 7 lead to radically different trajectories in clustering space. When A is approximately 1, the main contribution to the guide function of Equation 7 comes from the entropy difference due to sequence assignment to subfamilies; when A is approximately 0, clustering is driven by cluster size and mergers into smaller clusters are favorable. Changing the granularity parameter A in the guide function over a reasonable range of values and repeating hierarchical clustering explores sufficiently diverse partitionings to reach an optimum (Figure ).
Figure 6 Value landscape of the contrast function for a large protein family illustrating the optimization process. The algorithm searches for the minimal value of the contrast function (a combinatorial entropy difference [Equation 6]) by systematic exploration (more ...)
Note that although the guide function determines the details of each clustering step, the final optimum is chosen as the minimum of the combinatorial entropy difference (Equation 6) in the two-dimensional space of two variables, the clustering step l, and the penalty weight (1 - A). Typical optimal values of A in tests for diverse protein families range between 0.6 and 0.9.
Evidence for selective pressure and selection of specificity residues
Selective pressure in evolution results in patterns of conservation across all subfamilies (globally conserved residues) or in particular subfamilies (specificity residues). Examples of conserved residues are active site residues in enzyme families, and examples of specificity residues are residues lining active sites configured to bind a particular substrate optimally. The combinatorial entropy difference (Equation 6) is greatest for alignment columns with specificity residues (by definition; see above), but close to zero for 'nonspecific' columns that do not discriminate between subfamilies. Such 'nonspecific' columns have globally conserved residues or diverse nonspecific residue distributions. All other residue columns have intermediate values of ΔS0. Thus, if we sort residue columns by their entropy difference ΔS0 and plot the resulting distribution (Figure ), then we can typically identify two regions of particularly low and particularly high entropy difference ΔS0. For typical alignments, one can visually identify the characteristic extreme regions of the entropy as deviations from the linear central region.
Figure 7 Definition of specificity residues based on entropy values. Combinatorial entropy difference as a function of residue position (in rank order) for the actual (solid line) and randomized (dashed line) multiple alignment of 390 protein kinase sequences (more ...)
We compared entropy plots for the original alignment with the entropy plot for a randomized alignment (for details, see Figure ). The differences between the original and the randomized entropy plots are drastic; there are no downturn and upturn regions in the entropy plots for randomized alignments, and the absolute values of the entropy differences produced for the randomized alignments are several times smaller than those of the original alignments.
To distinguish between globally conserved columns and other 'nonspecific' columns, we compute the combinatorial entropy for each alignment column:
is the average entropy per residue for the residue distribution in alignment column i
is the fraction of residues of type α in column i
= 21 for gaps). We require si
< 0.03 and f21,i
< 0.5 for globally conserved columns; mathematical details related to Equations 10 and 11 are provided in Additional data file 3.
Test application: prediction of contact residues and evaluation of accuracy
Specificity residues - and, of course, globally conserved residues - reflect functional constraints that operate in evolution. They are an informational fossil record, most clearly visible over large evolutionary intervals during which the background distribution may vary considerably. The constraints can be of diverse origin, but it is plausible that all constraints can be traced to the requirements of intermolecular interactions that are important for survival. Therefore, prediction of specificity residues has broad applicability for the identification of functional interactions and, as a consequence, for ranking genetic variation, for planning mutation experiments, or for the molecular design of specificity.
Here, we test one particular application of the identification of specificity residues from multiple sequence alignments: the prediction of intermolecular interfaces. We use known three-dimensional structures of protein and DNA complexes from the Protein Data Bank (PDB) as defining experimental reality against which predictions are compared. A key limitation is that there may be several such interfaces in a given protein family and that the complexes in the PDB contain only a subset of these. Nonetheless, it is instructive to see the extent to which specificity residues, interpreted as predicted interface residues, overlap with known intermolecular interfaces. A large overlap indicates good prediction accuracy, but over-prediction (false positives) is expected.
To assess whether an observed overlap between specificity residues and intermolecular interface residues is statistically significant, we estimate the expected size of overlap in a random model, in which specificity residues are scattered randomly in the protein and may or may not end up in the known interface by chance. Suppose that the total number of protein residues is N, the number of the known interface residues is L, the number of the specificity residues is S, and the number of the specificity residues in the interface is A. If the specificity residues are randomly distributed, then what is the probability of observing A or more of the S specificity residues in the interface? For reasons of permutational degeneracy, one must compute the total number of indistinguishable variants of A distinct residues assigned to four sets of size K, M, J and (N - K - M - J) residues:
Then, the probability to observe at random A or more of S specificity residues among the L interface residues is given by the following ratio:
Where the numerator represents the number of all possible assignments for which the sets of size S and L have A or more common residues; and the denominator represents the total number of all possible assignments up to complete overlap of the two sets. To correct for the Nc globally conserved residues, which by definition are excluded from being identifies as specificity residues, we use N - Nc in Equation 12 in place of N.
Choice of multiple sequence alignments
The multiple sequence alignments are the only source of information used in the predictions. Predictions are best for accurate, nonredundant alignments of diverse sequences without significant gap regions. In the interface prediction tests, we used alignments from the 'Superfamily' [33
] and PFAM [34
] collections, as well as the Homology-Derived Secondary Structure of Proteins database [35
] and curated alignments of human protein kinases [36
] from the Protein Kinase Resource [37
]. As needed, the original alignments were prepared for specificity analysis by trimming deletions and insertions across the whole alignment so as to preserve the continuity of the main sequence (the sequence of a given protein); removing redundant sequences (typically at the level of about 95% identical residues for large alignments) using the MView program [38
]; and removing sequences with many gaps (for example, with more than about 10% to 20% gaps compared with the main sequence). Finally, the total number of sequences in the alignment must be large (>100).