|Home | About | Journals | Submit | Contact Us | Français|
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww.
Many protein families contain sub-families that exhibit functional specialization, often involving differences in ligand binding or protein–protein interactions (1). Consequently, an increasing number of methods and/or web applications has become available, which offer functional analyses of specificity-determining residues within protein families (2–10). These methods often require a multiple sequence alignment (MSA) with pre-determined groups or a phylogenetic tree as input. SDPpred (3) uses mutual information to identify positions that ‘are well conserved within specificity groups but differ between these groups’. PROUST-II (11) is a method based on cumulative relative entropy of the differences between hidden Markov profiles of user-defined sub-families.
Other methods only require the MSA and automatically group the sequences into sub-groups using, for example, Between Group Analysis (6) or phylogeny (2,12). Xdet (13) uses a classification derived from the alignment, and is based on mutual behaviour analysis of ‘tree-determinant’ residues. It can also be used, supervised by supplying an external (functional) classification. ProteinKeys (14) implements combinatorial entropy optimization to identify both specificity-determining residues and sub-families. A more recent method by Georgi et al. (10) requires only sequences and carries out sub-group discovery with simultaneous identification of functional residues.
Identifying specificity-determining residues in proteins has proved a difficult task (15) and methods have varying, but often modest success rates in determining these sites. Therefore, three different methods were combined by Chakrabarti and Panchenko (15) in an ensemble approach, and the predicted sites were studies in 3D context.
We here present a new interactive web server for the detection of sub-type specific sites in proteins. It combines improved versions of the validated Sequence Harmony (SH) (5,16) and multi-Relief (mR) (8) methods in a single server, multi-Harmony. SH is based on Shannon's entropy and determines to what extent amino acid compositions between groups differ. mR identifies residues based on the feature weighting algorithm RELIEF (17). We have generalized SH to handle multiple sub-groups, reimplemented mR and compare their performance relative to four methods: SDPpred (3), ProteinKeys (14), PROUST-II (11) and Xdet (13).
In this article, we will guide the user through all stages of the multi-Harmony web application. We will look for sub-type-specific sites for the five sub-families of the Smad protein family. The sub-type-specific sites found are the best candidates to explain functional differences. Other relevant applications of the method include protein–protein interaction (18), ligand specificity and combinations of both (19).
Below, we briefly outline the Sequence Harmony and multi-Relief algorithms. For further details on the SH and mR algorithms, we refer to our earlier work (5,8,16) and the online documentation on the web server.
SH now has been generalized to handle more than two sequence groups. This generalization goes in two stages. First, the sum of residue probabilities () in the two groups and , used in the original SH Equation (5), has been extended to groups as follows:
where is the probability of residue type in group at position . Shannon's ‘alphabet size’ for amino acid types and sequences in a group is used as base for the logarithm. Second, the average is generalized as: . SH values range from zero for completely non-overlapping residue compositions, to one for identical compositions. A toy example with some typical columns and corresponding SH values is presented in Table 1.
mR works by iterating RELIEF over pairs of groups and returning the average over the positive weights per position, or over the negative if no positive weights were obtained for that position (8). Given sequences from two groups, RELIEF assigns weights to features (alignment columns) by summation of the weight vector obtained as the bit-vector difference between a given sequence and its nearest neighbour from the opposite group, the ‘nearest miss’, and subtracting from that bit-vector difference with its nearest neighbour from the same group, the ‘nearest hit’.
The sampling strategy of mR has been changed from a stochastic to an exhaustive deterministic implementation. An all-against-all comparison of the sequences is carried out and all ‘nearest hits’ and ‘nearest misses’ are now compared (cf. 8). Thus, the user is no longer confronted with fine-tuning the number of iterations, or with results that differ between runs. In addition, the mR web output now reports support values. The toy example in Table 1 also shows the corresponding mR weights.
Table 1 shows example values for a hypothetical alignment. If residues are completely different between groups (Table 1, pos. 3) or completely conserved within groups (pos. 2) the SH score is zero. In the latter case, the mR weight is one. Negative mR weights appear when the position has different residues within a sub-family, but show conservation between sub-families (pos. 5 and 6).
The output now includes a significance measure in the form of empirical Z-scores for both SH and mR values. These values are produced by permuting the group labels and re-running (=100) times. For mR for efficiency reasons, the random values are based on a sub-sampling of pairs of groups. Z-score measures how many standard deviations (SD) the observed SH or mR value deviates from the mean of the respective ‘random’ scores for that data set. Completely conserved alignment columns have zero SD over the random scores, yielding an undefined Z-score.
An MSA in one of the main formats (FASTA, ClustalW, Stockholm, SELEX or GCG MSF) and a definition of sub-family groups within the alignment needs to be provided. Groups can either be defined within the sequence labels or provided as separate input. In addition, two optional inputs can be provided: (i) a reference sequence to compare the results for different alignments that contain the same reference sequence; (ii) a reference structure, either by PDB ID, file upload or on-the-fly ‘PDB BLAST’ against the PDB protein sequence database. Example input and output are provided as well as the possibility to regenerate the example output.
The SH and mR methods rely on a ranking scheme that does not need ‘training’, only cut-off values applied to the score values determine the number of selected sites. Validation and comparison to other state-of-the-art specificity detection methods have been carried out previously for SH (5) and for mR (8). mR has also been benchmarked and was among the three best performing methods out of five (15).
We here include the validation results for SH and mR on 7 data sets detailed in Table 2 and 15 data sets from another benchmarking study (15) (five overlapping families, Gprotein, LacI, Smad, RasRal and Rab56, were excluded). We follow the validation protocol described by Capra and Singh (21). Figure 1 shows box plots summarizing the distribution of ranks obtained by the different methods, as well as average precision/recall (PR) curves for SH and mR, and Table 3 summarizes area under the PR curve per dataset. For comparison, also results for ProteinKeys, PROUST-II, SDPpred and Xdet are shown. ProteinKeys has been run with default settings and alignment filtering turned off. PROUST-II predictions were displayed with the default minimum ‘AA Prob’ of 0.2 and ranked on Z-score. Xdet was run both unsupervised and supervised, in which case the groups were supplied as binary matrices defining the membership of a sequence to a group.
The PR plots in Figure 1B show that SH outperforms the other methods up to a recall of 20%. Beyond that, the performance remains comparable to the other state-of-the-art methods. SH would, therefore, seem to be a good choice when one is interested in a small number of highly significant specificity determining sites.
After uploading an MSA and sub-family groupings, the multi-Harmony server returns a highly dynamic results page as shown in Figure 2 The results for each alignment position are displayed in an interactive table (Figure 2A). The user can sort the table on any of the numerical fields (e.g. SH score or Z-score), can filter and highlight sites based on thresholds.
We illustrate multi-Harmony with receptor-regulated SMAD proteins (R-SMADs) (Figure 2). SMADs are transcription factors that play a crucial role in development (cell growth and differentiation) and disease (e.g. cancer) by mediating transforming growth factor β (TGF-β) signalling (24). SMADs can be divided into two major groups as is clear from the alignment (Figure 2) SMAD1, SMAD5 and SMAD8 are activated in response to bone morphogenetic protein signals, while SMAD2 and SMAD3 are activated in response to TGF-β or activin signals. Most of the interactions with SMADs occur via the Mad homology 2 (MH2) domain, which is responsible for the specificity of binding (25). The input alignment consists of 33 homologous vertebrate sequences of the MH2 domain from the five R-Smad groups.
The output table can be filtered on SH or mR values. In the case of sub-type specificity, we are interested in finding residues that are unique to sub-families. An SH score ranges from 0 to 1 and a mR weight from 1 to 1. A lower SH (harmony) indicates a more specific residue, while a higher mR weight indicates a more group-specific residue. Thus, the lower the SH score or the higher the mR weight, the better.
If we filter the output table for residues using a stringent mR weight threshold of 0.9, 42 (of 211) positions are returned. These include 24 of the 28 known functional sites (Table 2, cf. 5) Another additional eight residues (I277, T289, R337, L350, A371, E389, Q400 and R410) have an mR weight of 1, which means that these positions optimally differentiate between at least two groups in the SMAD alignment. For example, position I277 (pos. 16) is a conserved valine in the SMAD8 group, while it is an isoleucine in the other SMADs.
We can also filter the output table on SH scores and/or SH and mR Z-scores. The Z-score provides an intuitive way to filter the SH results: a Z-score of −3 indicates that the SH score is three SDs below the mean score of the 100 randomizations. Since the SH score should be lower than the ‘random’ mean, the most negative Z-scores are the most interesting. However, a very negative Z-score could also be obtained for a high SH score. This happens for example when the alignment column shows only two residues: one conserved in a small sub-group and one in all others, as in Table 1 pos. 1 and pos. 16 in Figure 2B. This indeed often coincides with an mR weight of one.
In general, by changing the Z-score, it is possible to tune the expected false discovery rate. A typical Z-score threshold would be less than −3 or, more stringent, less than −6. Indeed, a Z-score threshold of less than −12 returns a validated functional position (ARG365 for the SMADs, see also Figure 2A) and possibly positions that are different among the groups, but are conserved within a group. Table 3 illustrates the influence of the Z-score on the performance of SH. If the Z-score is used as a filter (less than −9) to split the SH scores in two groups, followed by ranking on SH score, the performance of SH increases by about 4%. For mR, this filtering has no clear advantage on these data sets. An optimal threshold is data set dependent, and particularly rises strongly with an increasing number of sub-groups. We, therefore, set a modest Z-score threshold of −3 by default.
This multi-Harmony server combines the enhanced Sequence Harmony and multi-Relief methods to study specificity-determining residues in proteins. The addition of multi-group handling to SH improves its useability. The new deterministic implementation of mR returns reproducible results in contrast to the previous (sampling) implementation. Furthermore, the empirical significance estimates for SH and mR improve the reliability of the results. The multi-Harmony server provides tabular output as an interactive environment to analyse selected residues in multiple alignment context using Jalview and in their 3D context with Jmol.
ENFIN, a Network of Excellence funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’ (LSHG-CT-2005-518254). The open access charges for this paper were partially waived by Oxford University Press, the rest was paid by ENFIN.
Conflict of interest statement. None declared.