PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2010 July 1; 38(Web Server issue): W35–W40.
Published online 2010 June 4. doi:  10.1093/nar/gkq415
PMCID: PMC2896201

Multi-Harmony: detecting functional specificity from sequence alignment

Abstract

Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww.

INTRODUCTION

Many protein families contain sub-families that exhibit functional specialization, often involving differences in ligand binding or protein–protein interactions (1). Consequently, an increasing number of methods and/or web applications has become available, which offer functional analyses of specificity-determining residues within protein families (2–10). These methods often require a multiple sequence alignment (MSA) with pre-determined groups or a phylogenetic tree as input. SDPpred (3) uses mutual information to identify positions that ‘are well conserved within specificity groups but differ between these groups’. PROUST-II (11) is a method based on cumulative relative entropy of the differences between hidden Markov profiles of user-defined sub-families.

Other methods only require the MSA and automatically group the sequences into sub-groups using, for example, Between Group Analysis (6) or phylogeny (2,12). Xdet (13) uses a classification derived from the alignment, and is based on mutual behaviour analysis of ‘tree-determinant’ residues. It can also be used, supervised by supplying an external (functional) classification. ProteinKeys (14) implements combinatorial entropy optimization to identify both specificity-determining residues and sub-families. A more recent method by Georgi et al. (10) requires only sequences and carries out sub-group discovery with simultaneous identification of functional residues.

Identifying specificity-determining residues in proteins has proved a difficult task (15) and methods have varying, but often modest success rates in determining these sites. Therefore, three different methods were combined by Chakrabarti and Panchenko (15) in an ensemble approach, and the predicted sites were studies in 3D context.

We here present a new interactive web server for the detection of sub-type specific sites in proteins. It combines improved versions of the validated Sequence Harmony (SH) (5,16) and multi-Relief (mR) (8) methods in a single server, multi-Harmony. SH is based on Shannon's entropy and determines to what extent amino acid compositions between groups differ. mR identifies residues based on the feature weighting algorithm RELIEF (17). We have generalized SH to handle multiple sub-groups, reimplemented mR and compare their performance relative to four methods: SDPpred (3), ProteinKeys (14), PROUST-II (11) and Xdet (13).

In this article, we will guide the user through all stages of the multi-Harmony web application. We will look for sub-type-specific sites for the five sub-families of the Smad protein family. The sub-type-specific sites found are the best candidates to explain functional differences. Other relevant applications of the method include protein–protein interaction (18), ligand specificity and combinations of both (19).

METHODS

Algorithms

Below, we briefly outline the Sequence Harmony and multi-Relief algorithms. For further details on the SH and mR algorithms, we refer to our earlier work (5,8,16) and the online documentation on the web server.

Multi Sequence Harmony

SH now has been generalized to handle more than two sequence groups. This generalization goes in two stages. First, the sum of residue probabilities (An external file that holds a picture, illustration, etc.
Object name is gkq415i1.jpg) in the two groups An external file that holds a picture, illustration, etc.
Object name is gkq415i2.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq415i3.jpg, used in the original SH Equation (5), has been extended to An external file that holds a picture, illustration, etc.
Object name is gkq415i4.jpg groups as follows:

An external file that holds a picture, illustration, etc.
Object name is gkq415um1.jpg

where An external file that holds a picture, illustration, etc.
Object name is gkq415i5.jpg is the probability of residue type An external file that holds a picture, illustration, etc.
Object name is gkq415i6.jpg in group An external file that holds a picture, illustration, etc.
Object name is gkq415i7.jpg at position An external file that holds a picture, illustration, etc.
Object name is gkq415i8.jpg. Shannon's ‘alphabet size’ An external file that holds a picture, illustration, etc.
Object name is gkq415i9.jpg for An external file that holds a picture, illustration, etc.
Object name is gkq415i10.jpg amino acid types and An external file that holds a picture, illustration, etc.
Object name is gkq415i11.jpg sequences in a group is used as base for the logarithm. Second, the average is generalized as: An external file that holds a picture, illustration, etc.
Object name is gkq415i12.jpg. SH values range from zero for completely non-overlapping residue compositions, to one for identical compositions. A toy example with some typical columns and corresponding SH values is presented in Table 1.

Table 1.
Hypothetical alignment of three sub-families to illustrate the SH scores (range from 0 to 1) and mR weights (range from −1 to 1)

multi-Relief

mR works by iterating RELIEF over pairs of groups and returning the average over the positive weights per position, or over the negative if no positive weights were obtained for that position (8). Given sequences from two groups, RELIEF assigns weights to features (alignment columns) by summation of the weight vector obtained as the bit-vector difference between a given sequence and its nearest neighbour from the opposite group, the ‘nearest miss’, and subtracting from that bit-vector difference with its nearest neighbour from the same group, the ‘nearest hit’.

The sampling strategy of mR has been changed from a stochastic to an exhaustive deterministic implementation. An all-against-all comparison of the sequences is carried out and all ‘nearest hits’ and ‘nearest misses’ are now compared (cf. 8). Thus, the user is no longer confronted with fine-tuning the number of iterations, or with results that differ between runs. In addition, the mR web output now reports support values. The toy example in Table 1 also shows the corresponding mR weights.

Toy example

Table 1 shows example values for a hypothetical alignment. If residues are completely different between groups (Table 1, pos. 3) or completely conserved within groups (pos. 2) the SH score is zero. In the latter case, the mR weight is one. Negative mR weights appear when the position has different residues within a sub-family, but show conservation between sub-families (pos. 5 and 6).

Statistical significance

The output now includes a significance measure in the form of empirical Z-scores for both SH and mR values. These values are produced by permuting the group labels and re-running An external file that holds a picture, illustration, etc.
Object name is gkq415i13.jpg (=100) times. For mR for efficiency reasons, the random values are based on a sub-sampling of pairs of groups. Z-score measures how many standard deviations (SD) the observed SH or mR value deviates from the mean of the respective ‘random’ scores for that data set. Completely conserved alignment columns have zero SD over the random scores, yielding an undefined Z-score.

Web server

User input

An MSA in one of the main formats (FASTA, ClustalW, Stockholm, SELEX or GCG MSF) and a definition of sub-family groups within the alignment needs to be provided. Groups can either be defined within the sequence labels or provided as separate input. In addition, two optional inputs can be provided: (i) a reference sequence to compare the results for different alignments that contain the same reference sequence; (ii) a reference structure, either by PDB ID, file upload or on-the-fly ‘PDB BLAST’ against the PDB protein sequence database. Example input and output are provided as well as the possibility to regenerate the example output.

Processing

The server scripts are coded in Javascript, PHP, and (Bio)Perl. The main SH and mR scripts are coded in Python. If a reference sequence and/or a PDB structure is provided, the positions in the alignment are mapped to the corresponding positions in the reference sequence and/or PDB structure. ‘PDB BLAST’ uses NCBI BLAST (20) with a locally installed non-redundant PDB protein sequence database (pdbaa from NCBI).

Validation

The SH and mR methods rely on a ranking scheme that does not need ‘training’, only cut-off values applied to the score values determine the number of selected sites. Validation and comparison to other state-of-the-art specificity detection methods have been carried out previously for SH (5) and for mR (8). mR has also been benchmarked and was among the three best performing methods out of five (15).

We here include the validation results for SH and mR on 7 data sets detailed in Table 2 and 15 data sets from another benchmarking study (15) (five overlapping families, Gprotein, LacI, Smad, RasRal and Rab56, were excluded). We follow the validation protocol described by Capra and Singh (21). Figure 1 shows box plots summarizing the distribution of ranks obtained by the different methods, as well as average precision/recall (PR) curves for SH and mR, and Table 3 summarizes area under the PR curve per dataset. For comparison, also results for ProteinKeys, PROUST-II, SDPpred and Xdet are shown. ProteinKeys has been run with default settings and alignment filtering turned off. PROUST-II predictions were displayed with the default minimum ‘AA Prob’ of 0.2 and ranked on Z-score. Xdet was run both unsupervised and supervised, in which case the groups were supplied as binary matrices defining the membership of a sequence to a group.

Figure 1.
Validation results for the SH and mR methods. ProteinKeys, PROUST-II, SDPpred v.2 and Xdet are shown for comparison. Results obtained by the different methods were averaged over all data sets weighted by the number of positives. (A) Box plots showing ...
Table 2.
Properties of our seven data sets used for benchmark comparison of the algorithms
Table 3.
Validation for detection of specificity sites by SH and mR scored as area under curve (AUC) for the PR plots versus gold-standard specificity sites in the 22 data sets, 7 sets as defined in Table 2 and 15 sets obtained from Chakrabarti and Panchenko ( ...

The PR plots in Figure 1B show that SH outperforms the other methods up to a recall of 20%. Beyond that, the performance remains comparable to the other state-of-the-art methods. SH would, therefore, seem to be a good choice when one is interested in a small number of highly significant specificity determining sites.

RESULTS AND DISCUSSION

After uploading an MSA and sub-family groupings, the multi-Harmony server returns a highly dynamic results page as shown in Figure 2 The results for each alignment position are displayed in an interactive table (Figure 2A). The user can sort the table on any of the numerical fields (e.g. SH score or Z-score), can filter and highlight sites based on thresholds.

Figure 2.
An example of the multi-Harmony output. (A) The main output table, sorted by SH score and filtered on SH score (An external file that holds a picture, illustration, etc.
Object name is gkq415i14.jpg0.5) and high mR weight (An external file that holds a picture, illustration, etc.
Object name is gkq415i15.jpg0.8). Only ALA278 at position 17 in the alignment is not a confirmed functional residue. The columns with arrows can ...

We have included the Jalview (22) and Jmol applets (23) and exploit their Javascript–Java interface for enhanced interactivity, as compared to the previous SH and mR servers, which provided only static output tables. The sequence groups, SH scores and mR weights are annotated on the Jalview alignment. In addition, the user can interactively add annotation tracks to the Jalview alignment to mark positions that pass the supplied table filter thresholds. Such a track is shown in Figure 2B. If a PDB structure is provided, the results can be visualized on the PDB structure (Jmol). The entire structure can be coloured according to SH scores or mR weights. Residues passing the filters can also be dynamically highlighted, thereby providing a view of these residues in 3D context (Figure 2C). Finally, the user can download the plain-text output of the analysis programs.

We illustrate multi-Harmony with receptor-regulated SMAD proteins (R-SMADs) (Figure 2). SMADs are transcription factors that play a crucial role in development (cell growth and differentiation) and disease (e.g. cancer) by mediating transforming growth factor β (TGF-β) signalling (24). SMADs can be divided into two major groups as is clear from the alignment (Figure 2) SMAD1, SMAD5 and SMAD8 are activated in response to bone morphogenetic protein signals, while SMAD2 and SMAD3 are activated in response to TGF-β or activin signals. Most of the interactions with SMADs occur via the Mad homology 2 (MH2) domain, which is responsible for the specificity of binding (25). The input alignment consists of 33 homologous vertebrate sequences of the MH2 domain from the five R-Smad groups.

The output table can be filtered on SH or mR values. In the case of sub-type specificity, we are interested in finding residues that are unique to sub-families. An SH score ranges from 0 to 1 and a mR weight from An external file that holds a picture, illustration, etc.
Object name is gkq415i18.jpg1 to 1. A lower SH (harmony) indicates a more specific residue, while a higher mR weight indicates a more group-specific residue. Thus, the lower the SH score or the higher the mR weight, the better.

If we filter the output table for residues using a stringent mR weight threshold of An external file that holds a picture, illustration, etc.
Object name is gkq415i19.jpg0.9, 42 (of 211) positions are returned. These include 24 of the 28 known functional sites (Table 2, cf. 5) Another additional eight residues (I277, T289, R337, L350, A371, E389, Q400 and R410) have an mR weight of 1, which means that these positions optimally differentiate between at least two groups in the SMAD alignment. For example, position I277 (pos. 16) is a conserved valine in the SMAD8 group, while it is an isoleucine in the other SMADs.

We can also filter the output table on SH scores and/or SH and mR Z-scores. The Z-score provides an intuitive way to filter the SH results: a Z-score of −3 indicates that the SH score is three SDs below the mean score of the 100 randomizations. Since the SH score should be lower than the ‘random’ mean, the most negative Z-scores are the most interesting. However, a very negative Z-score could also be obtained for a high SH score. This happens for example when the alignment column shows only two residues: one conserved in a small sub-group and one in all others, as in Table 1 pos. 1 and pos. 16 in Figure 2B. This indeed often coincides with an mR weight of one.

In general, by changing the Z-score, it is possible to tune the expected false discovery rate. A typical Z-score threshold would be less than −3 or, more stringent, less than −6. Indeed, a Z-score threshold of less than −12 returns a validated functional position (ARG365 for the SMADs, see also Figure 2A) and possibly positions that are different among the groups, but are conserved within a group. Table 3 illustrates the influence of the Z-score on the performance of SH. If the Z-score is used as a filter (less than −9) to split the SH scores in two groups, followed by ranking on SH score, the performance of SH increases by about 4%. For mR, this filtering has no clear advantage on these data sets. An optimal threshold is data set dependent, and particularly rises strongly with an increasing number of sub-groups. We, therefore, set a modest Z-score threshold of −3 by default.

CONCLUSION

This multi-Harmony server combines the enhanced Sequence Harmony and multi-Relief methods to study specificity-determining residues in proteins. The addition of multi-group handling to SH improves its useability. The new deterministic implementation of mR returns reproducible results in contrast to the previous (sampling) implementation. Furthermore, the empirical significance estimates for SH and mR improve the reliability of the results. The multi-Harmony server provides tabular output as an interactive environment to analyse selected residues in multiple alignment context using Jalview and in their 3D context with Jmol.

FUNDING

ENFIN, a Network of Excellence funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’ (LSHG-CT-2005-518254). The open access charges for this paper were partially waived by Oxford University Press, the rest was paid by ENFIN.

Conflict of interest statement. None declared.

REFERENCES

1. Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 2003;36:307–340. [PubMed]
2. del Sol A, Pazos F, Valencia A. Automatic methods for predicting functionally important residues. J. Mol. Biol. 2003;326:1289–1302. [PubMed]
3. Kalinina OV, Mironov AA, Gelfand MS, Rakhmaninova AB. Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci. 2004;13:443–456. [PubMed]
4. Donald JE, Shakhnovich EI. Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res. 2005;33:4455–4465. [PMC free article] [PubMed]
5. Pirovano W, Feenstra KA, Heringa J. Sequence comparison by sequence harmony identifies subtype-specific functional sites. Nucleic Acids Res. 2006;34:6540–6548. [PubMed]
6. Wallace IM, Higgins DG. Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics. 2007;8:135. [PMC free article] [PubMed]
7. Sankararaman S, Sjölander K. INTREPID—INformation-theoretic TREe traversal for protein functional site IDentification. Bioinformatics. 2008;10:2445–2452. [PubMed]
8. Ye K, Feenstra KA, Heringa J, IJzerman AP, Marchiori E. Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a machine-learning approach for feature weighting. Bioinformatics. 2008;24:18–25. [PubMed]
9. Kalinina O, Gelfand M, Russell R. Combining specificity determining and conserved residues improves functional site prediction. BMC Bioinformatics. 2009;10:174. [PMC free article] [PubMed]
10. Georgi B, Schultz J, Schliep A. Partially-supervised protein subclass discovery with simultaneous annotation of functional residues. BMC Struct. Biol. 2009;9:68. [PMC free article] [PubMed]
11. Hannenhalli SS, Russell RB. Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol. 2000;303:61–76. [PubMed]
12. Pei J, Cai W, Kinch LN, Grishin NV. Prediction of functional specificity determinants from protein sequences using log-likelihood ratios. Bioinformatics. 2006;22:164–171. [PubMed]
13. Pazos F, Rausell A, Valencia A. Phylogeny-independent detection of functional residues. Bioinformatics. 2006;22:1440–1448. [PubMed]
14. Reva B, Antipin Y, Sander C. Determinants of protein function revealed by combinatorial entropy optimization. Genome Biol. 2007;8:R232. [PMC free article] [PubMed]
15. Chakrabarti S, Panchenko AR. Ensemble approach to predict specificity determinants: benchmarking and validation. BMC Bioinformatics. 2009;10:207. [PMC free article] [PubMed]
16. Feenstra KA, Pirovano W, Krab K, Heringa J. Sequence Harmony: detecting functional specificity from alignments. Nucleic Acids Res. 2007;35:W495–W498. [PMC free article] [PubMed]
17. Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In: Bergadano F, De Raedt L, editors. European Conference on Machine Learning. Vol. 784. New York, Secaucus, NJ, USA: Springer-Verlag; 1994. pp. 171–182. LNCS.
18. Feenstra KA, Bastianelli G, Heringa J. Predicting protein interactions from functional specificity. In: Hansmann UHE, Meinke JH, Mohanty S, Nadler W, Zimmermann O, editors. From Computational Biophysics to Systems Biology (CBSB08) John von Neumann Institute for Computing, Jülich (Germany), Vol. 40 of NIC Series; 2008. pp. 89–92.
19. Rausell A, Juan D, Pazos F, Valencia A. Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl Acad. Sci. USA. 2010;107:1995–2000. [PubMed]
20. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
21. Capra JA, Mona Singh M. Characterization and prediction of residues determining protein functional specificity. Bioinformatics. 2008;24:1473–1480. [PMC free article] [PubMed]
22. Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;20:426–427. [PMC free article] [PubMed]
23. Herráez A. Biomolecules in the computer: Jmol to the rescue. Biochem. Mol. Biol. Educ. 2006;34:255–261. [PubMed]
24. Attisano L, Wrana JL. Signal transduction by the TGF-β superfamily. Science. 2002;296:1646–1647. [PubMed]
25. Feng XH, Derynck R. Specificity and versatility in TGF-β signaling through Smads. Annu. Rev. Cell. Dev. Biol. 2005;21:659–693. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press