PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. Jul 2007; 35(Web Server issue): W653–W658.
Published online May 21, 2007. doi:  10.1093/nar/gkm293
PMCID: PMC1933213
COMPASS server for remote homology inference
Ruslan I. Sadreyev,1* Ming Tang,1 Bong-Hyun Kim,2 and Nick V. Grishin1,2
1Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA and 2Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
*To whom correspondence should be addressed. Phone: 214-645-5951, Fax: 214-645-5948, ; sadreyev/at/chop.swmed.edu
Received January 31, 2007; Revised March 30, 2007; Accepted April 12, 2007.
COMPASS is a method for homology detection and local alignment construction based on the comparison of multiple sequence alignments (MSAs). The method derives numerical profiles from given MSAs, constructs local profile-profile alignments and analytically estimates E-values for the detected similarities. Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the latest version of COMPASS, which provides (i) increased sensitivity and selectivity of homology detection; (ii) longer, more complete alignments; and (iii) faster computational speed. After submission of the query MSA or single sequence, the server performs searches versus a user-specified database. The server includes detailed and intuitive control of the search parameters. A flexible output format, structured similarly to BLAST and PSI-BLAST, provides an easy way to read and analyze the detected profile similarities. Brief help sections are available for all input parameters and output options, along with detailed documentation. To illustrate the value of this tool for protein structure-functional prediction, we present two examples of detecting distant homologs for uncharacterized protein families. Available at http://prodata.swmed.edu/compass
Accurate detection of sequence similarity between distantly related proteins is essential for many fields, including protein structure prediction, protein engineering, and comparative genomics. The performance of an automatic method for sequence comparison can be characterized by sensitivity, selectivity and accuracy of produced sequence alignments. All these parameters can be significantly improved by comparing multiple sequence alignments (MSAs) rather than individual sequences. The improvement comes from evolutionary information about residue preferences at sequence positions in the family represented by the MSA. This information can be extracted from MSAs in two numerical forms: ‘traditional’ position-specific profiles and hidden Markov models (HMMs). The well-known and popular methods for profile-sequence or HMM-sequence comparison include PSI-BLAST (1,2), HMMER (3), SAM-T (4,5) and others. A newer generation of methods involves the comparison of two profiles (6–10) or two HMMs (11,12), with several corresponding web servers available (13–16). These methods further improve the quality of homology detection and alignment construction (17,18). There is a number of publicly available web servers aimed at protein structure prediction that use these and a variety of other techniques [for example, (19–23)].
COMPASS (9) is an established method for profile-based comparison of MSAs. COMPASS derives numerical profiles from given MSAs, constructs optimal local profile-profile alignments, and analytically estimates E-values for the detected similarities. As previously shown by us (9) and independently verified by others (12,18), COMPASS is a sensitive and selective tool for detection of remote sequence similarity that offers accurate local alignments. In many cases, COMPASS provides accurate homology detection and structure prediction that would be difficult or impossible to produce by PSI-BLAST (9,24).
As a standalone package, COMPASS has been used by different research groups (24–31). Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the recently improved version of COMPASS.
To compare two MSAs, COMPASS performs four steps: (i) processing input MSAs and generating numerical profiles; (ii) calculating scores between individual positions of the compared profiles; (iii) finding optimal local alignment of the two profiles; and (iv) assessing statistical significance of the optimal alignment score (9).
Methodologically, COMPASS is a generalization to profile-profile comparison of the PSI-BLAST approach to profile-sequence comparison. Numerical profiles represent effective counts and frequencies of 21 symbols (20 residue types and gaps) at each position of the input MSAs. To search with a query MSA against a database of MSAs, the database profiles are pre-computed in advance. Scores for the similarity between individual profile positions are calculated using our original formula (9) and then rescaled so that their distribution is similar to a standard distribution with well-known properties (such as BLOSUM62 substitution scores). Rescaled positional scores are used to find the optimal local alignment using the Smith–Waterman algorithm. The statistical significance of the optimal alignment score is estimated using a simple formula for E-value (the expected number of hits in a random database with a score equal to or greater than the observed score). The parameters of this formula are based on our extensive simulations of random profile comparisons (9). As the final result of the search, a list of the most significant hits for the submitted query is displayed, followed by the optimal profile-profile alignments.
According to our results (9) and independent evaluations (12,18), COMPASS performance has been demonstrated to be among the top methods for profile comparison, by both the quality of homology detection and the accuracy of local alignment construction. The presented web server features a newer version of COMPASS, with several major modifications to improve performance.
  • Higher quality of homology detection. Evaluation of the statistical significance of hits is improved by using a more realistic null model of random profile comparison. The original random model involved the profiles composed of randomly sampled positions from real MSAs. The score statistics were modeled depending on the profile lengths only, and a rough linear approximation of the dependency was used (9). We developed a new random model that captures additional important features of real profiles. First, in order to reproduce local correlations between different positions of MSA, we generate random profiles from fragments of real profiles corresponding to individual elements of secondary structure. Second, to model more accurately the distribution parameters K and λ (2,9) for optimal profile-profile scores, we introduce their dependence on the profile ‘thickness’ (sequence divergence within the profiles). Finally, we use more precise non-linear functions (combinations of quadratic and square-root) to describe the dependency of these parameters on profile length and ‘thickness’. According to our preliminary results, the new version of COMPASS shows roughly 20–25% improvement in the quality of similarity detection.
  • Longer, more complete local alignments. Rescaling of individual positional scores is modified, so that alignment coverage increases. In the original version, this procedure was similar to the composition-based statistic in PSI-BLAST (2), which standardized positional scores by adjusting the distribution parameter lambda (describing mainly the distribution width). In the new version, in order to make the rescaled distribution closer to standard, the mean of the distribution is also forced to a fixed value. As a result, positional scores are more compatible with the gap penalties that were empirically optimized for the standard substitution matrices (e.g. BLOSUM 62). The optimal alignments on average become longer and cover similarity regions better without compromising the overall alignment accuracy.
  • Improved speed. Several algorithmic modifications, as well as a general code optimization, lead to an order of magnitude improvement in computational speed over the original version. The resulting computational efficiency is now comparable to that of the fastest profile-profile methods (12,15), with a typical search taking a few minutes on one processor. This time period may increase when the server is heavily loaded or when the user requires generation of the query profile by PSI-BLAST search, which may take longer for queries with a large number of homologs in the sequence database.
  • Flexible control of input options. The server's front page (Figure 1A) allows the user to upload the query in several common alignment formats, choose the database and adjust search parameters and output options. The query MSA or single sequence can be either pasted in the input window or uploaded from a file. The available profile databases currently include PFAM (32), COG, KOG (33,34) and PSI-BLAST alignments produced from sequences with known 3D structure: chain representatives of the PDB database (35) and domain representatives of SCOP classification (36). The PDB representatives are full chains extracted from the whole set of available 3D structures (35), based on a 70% cutoff of sequence identity. The SCOP representatives are structural domains defined and classified by expert analysis into families, superfamilies, folds and classes (36). These representatives are based on 40% identity and are taken from the ASTRAL database (37). The PDB and ASTRAL sequences are used as queries for PSI-BLAST searches against NCBI nr database. The resulting MSAs of detected homologs are used to generate COMPASS profiles. To allow for the choice of different levels of sequence divergence within MSAs, the user can choose profiles corresponding to different numbers of PSI-BLAST iterations. PFAM (32), COG and KOG (33,34) databases include families of both known and unknown 3D structure, which cover protein sequence space more completely and provide alternative ways of family classification. These databases typically represent tighter sequence grouping, with more consideration of protein function, and clustering of orthologs from different genomes. PFAM profiles are generated by COMPASS from full family alignments provided by PFAM. COG and KOG profiles are generated from MSAs produced from the database sequences by MUSCLE (38). The profile databases are regularly updated when new versions of original databases are available.
    Figure 1.
    Figure 1.
    (A) Front page of the COMPASS server. The main section allows the user to submit the query (by pasting in the window or by specifying the file), to choose the search database, and (if needed) to enter the email address to receive the results. The section (more ...)
In order to gain more confidence in detected similarities and to find the best search conditions for a specific query, tuning the parameters controlling the generation of profiles and the construction of profile-profile alignments is advisable. The user can modify several such parameters. First, the input MSA (or sequence) can be used as a query for PSI-BLAST search, in order to produce a more diverse MSA of this family. The user can adjust the maximal number of iterations, as well as the requirements for a detected homolog to be included in the alignment: maximal E-value, minimal coverage of the query and minimal sequence identity to the query. Second, ‘Gap fraction threshold’ allows the user to control the maximal content of gaps in the MSA columns included in the COMPASS profile. If a column contains too many gaps, it is disregarded in the process of profile comparison, and shown in the final output as lower-case letters for residues and dots for gaps. The default value of this parameter is 0.5.
In the construction of profile-profile alignments, ‘Gap penalties’ are score penalties for opening and extending a new gap. ‘Effective length of the database’ is the parameter used in the calculation of E-values for the profile-profile alignments. For a given optimal alignment score, there is roughly a linear dependence of E-value on the assumed database length. ‘Matrix’ is a substitution matrix of the user's choice, BLOSUM62 by default. As described above, the choice of the matrix affects the rescaling of scores between individual profile positions that are used in the construction of the profile-profile alignment. Changing the scale of the positional scores would (i) make gap insertion more or less likely, affecting the resulting alignments, and (ii) change the optimal alignment scores and E-values.
Among the output formatting options, many are similar to those of PSI-BLAST. ‘Expect’ and ‘significance threshold’ are, respectively, the E-value cutoffs for the hit to be included in the output and to be considered significant. The hits outside the significance threshold are shown as potentially not meaningful. The user can also limit the total number of hits to display (‘Display up to’). Some output options are specific to profile-profile comparison. For example, the displayed profile-profile alignments can include different numbers of top sequences from the input MSAs (‘Top sequences to show’), as well as consensus sequences (‘Show consensus sequences’). Brief help sections are provided for every adjustable parameter, as well as a link to more detailed documentation (Figure 1A).
  • (v) User-friendly output. The general structure of the output is similar to that of PSI-BLAST: the list of top hits is sorted by E-value and split into those below and above the significance threshold, followed by optimal profile-profile alignments with brief information about each hit. However, there are several significant differences, mainly in the format of alignments. The user can display the consensus sequences of profiles, as well as multiple top sequences from the input MSA. The number of top sequences displayed can range from zero (to show consensus only) to all sequences of the MSA. The complete query MSA is retrieved by clicking on the consensus link. Another feature for fast and convenient analysis is links to the original databases, which provide immediate access to information available for detected protein families.
Examples of remote similarity detection
As an illustration, we describe the detection of distant sequence similarities that lead to fold predictions for two uncharacterized PFAM families annotated as ‘DUF’ (domain of unknown function). First, the COMPASS server detects homology between DUF185 (corresponding to COG1565 of the COG database) and SCOP domains of the S-adenosyl-l-methionine-dependent methyltransferase (SAM-Mtase) fold. Using the full DUF185 (PFAM 19.0) alignment as a query, with the default input parameters (Figure 1A), the server returns a list of hits that consistently belong to the same SCOP superfamily (c.66.1), both above and below the E-value cutoff (Figure 1B). In this list, each line consists of four fields: the identifier in the original database (implemented as a link to the database), a brief description of the protein, the COMPASS score and the corresponding E-value.
The next section of the output includes profile-profile alignments between the query and the hits. Each alignment is accompanied by a header with a brief information about the hit. Unlike the PSI-BLAST format, the alignments can include different numbers of top sequences from input MSAs and/or consensus sequences. Figure 1C shows an example of such an alignment, with a single top sequence and consensus displayed for each profile. To distinguish the gaps introduced by COMPASS from the gaps that already occur in the input alignments, the former are shown as equal signs (=). The alignment in Figure 1C includes the region of similarity between the query (profile for DUF185) and a homologous profile based on the PSI-BLAST alignment for structural domain 1i4wA. In addition to similar patterns of hydrophobicity and small residues, DUF185 shows a strong conservation of SAM-Mtase signature motifs [reviewed in (39)]. The SAM-binding loop GxGxG (circled) and conserved acidic residue in the preceding β-strand (marked with a red dot) are parts of Motif I, whereas the invariant glutamate at the end of the next β-strand (marked with a red dot) is a part of Motif II (39).
This previously published prediction had been difficult to produce by PSI-BLAST, even for an expert user (24). However, it was more recently confirmed by the solved structure of a DUF185 member. This structure (PDB ID 1zkd, Northeast Structural Genomics Consortium) has been neither functionally annotated nor classified by SCOP or CATH, but possesses typical features of the SAM-Mtase fold (Figure 1D). The core of the domain contains a mixed β-sheet of seven β-strands surrounded by two sheets of α-helices. The strand order is 3214576; with strand 7 (colored red) anti-parallel to the rest and forming a characteristic methyltransferase β-hairpin with strand 6 (colored orange). In this domain, the β-hairpin contains an additional α-helical insert (orange helices). The presence of a glycine-rich loop (circled) and other signature motifs, including glutamates marked in Figure 1C (side chains shown in red), suggest that this domain is a functional methyltransferase.
The second prediction originates from searching with RrnaAD methylase family as a query. This search reveals a newly identified similarity to a PFAM family of mainly hypothetical bacterial proteins with unknown structure and function, DUF519 (corresponding to COG2961 in the COG database). Thus, we suggest that DUF519/COG2961 proteins also possess the structural SAM-Mtase fold. This hypothesis is supported by the results of a search with the PFAM 19.0 DUF519 alignment as a query against the database of SCOP profiles (PSI-BLAST iteration 3). Homologs detected above the significance threshold, as well as multiple hits below the threshold, consistently belong to the SAM-Mtase fold.
Figure 2A shows the COMPASS alignment between DUF519 and the detected homolog, a domain of the SAM-Mtase fold (PDB ID 1qyrA). This domain (not shown) possesses typical features of the fold and is similar to the structure shown in Figure 1D. Figure 2A shows the COMPASS alignment including the signature motifs of SAM-Mtases. Figure 2B shows the MSA of representatives from both families that covers SAM-Mtase Motifs I and II (39). In DUF519, this region includes the invariant glutamate aligned to a ligand-binding glutamate of SAM-Mtases (E95 in the top sequence, marked with red dot), the characteristic location of conserved small residues in the SAM-binding loop (marked with a line) and a similar hydrophobicity pattern. Secondary structure prediction for this part of DUF519 is also consistent with the secondary structure of the SAM-Mtase fold. This prediction is additionally supported by other tools, e.g. by (i) significant scores for the similarity with the SCOP SAM-Mtase domains produced by FFAS03 server (14); and (ii) the results of multiple iterations of PSI-BLAST search in a sequence database with a family representative as a query. After four iterations, PSI-BLAST detects the similarity between a DUF519 sequence Q9PHA1_XYLFA (gi|15836648, residues 32-291) and two proteins of known structure possessing the SAM-Mtase fold (PDB IDs 2ift and 2fpo).
Figure 2.
Figure 2.
Search results for PFAM DUF519 suggest that this family possesses the structural fold of SAM-Mtases. (A) DUF519 is used as a query for the COMPASS search against the databases of PSI-BLAST alignments (iteration 3) for SCOP representatives. The COMPASS (more ...)
ACKNOWLEDGEMENTS
The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing high-performance computing resources. We would like to thank Lisa Kinch and James Wrabl for discussions and critical reading of the manuscript. Funding to pay the Open Access publication charges for this article was provided by Howard Hughes Medical Institute.
Conflict of interest statement. None declared.
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
2. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. [PMC free article] [PubMed]
3. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. [PubMed]
4. Karplus K, Barrett C, Cline M, Diekhans M, Grate L, Hughey R. Predicting protein structure using only sequence information. Proteins. 1999;37(Suppl. 3):121–125. [PubMed]
5. Karplus K, Karchin R, Draper J, Casper J, Mandel-Gutfreund Y, Diekhans M, Hughey R. Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins. 2003;53(Suppl. 6):491–496. [PubMed]
6. Pietrokovski S. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res. 1996;24:3836–3845. [PMC free article] [PubMed]
7. Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 2000;9:232–241. [PubMed]
8. Yona G, Levitt M. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 2002;315:1257–1275. [PubMed]
9. Sadreyev RI, Grishin NV. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol. 2003;326:317–336. [PubMed]
10. Ginalski K, von Grotthuss M, Grishin NV, Rychlewski L. Detecting distant homology with Meta-BASIC. Nucleic Acids Res. 2004;32:W576–581. [PMC free article] [PubMed]
11. Edgar RC, Sjolander K. COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics. 2004;20:1309–1318. [PubMed]
12. Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. [PubMed]
13. Frenkel-Morgenstern M, Singer A, Bronfeld H, Pietrokovski S. One-Block CYRCA: an automated procedure for identifying multiple-block alignments from single block queries. Nucleic Acids Res. 2005;33:W281–W283. [PMC free article] [PubMed]
14. Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A. FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Res. 2005;33:W284–W288. [PMC free article] [PubMed]
15. Soding J, Biegert A, Lupas AN. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 2005;33:W244–W248. [PMC free article] [PubMed]
16. Soding J, Remmert M, Biegert A, Lupas AN. HHsenser: exhaustive transitive profile search using HMM-HMM comparison. Nucleic Acids Res. 2006;34:W374–W378. [PMC free article] [PubMed]
17. Ohlson T, Wallner B, Elofsson A. Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods. Proteins. 2004;57:188–197. [PubMed]
18. Wang G, Dunbrack RL., Jr Scoring profile-to-profile sequence alignments. Protein Sci. 2004;13:1612–1626. [PubMed]
19. Chivian D, Kim DE, Malmstrom L, Schonbrun J, Rohl CA, Baker D. Prediction of CASP6 structures using automated Robetta protocols. Proteins. 2005;61(Suppl. 7):157–166. [PubMed]
20. Ginalski K, Elofsson A, Fischer D, Rychlewski L. 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics. 2003;19:1015–1018. [PubMed]
21. Kelley LA, MacCallum RM, Sternberg MJ. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 2000;299:499–520. [PubMed]
22. Shi J, Blundell TL, Mizuguchi K. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 2001;310:243–257. [PubMed]
23. Zhou H, Zhou Y. SPARKS 2 and SP3 servers in CASP6. Proteins. 2005;61(Suppl. 7):152–156. [PubMed]
24. Sadreyev RI, Baker D, Grishin NV. Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci. 2003;12:2262–2272. [PubMed]
25. Birtle Z, Ponting CP. Meisetz and the birth of the KRAB motif. Bioinformatics. 2006;22:2841–2845. [PubMed]
26. Kim BH, Sadreyev R, Grishin NV. COG4849 is a novel family of nucleotidyltransferases. J. Mol. Recognit. 2005;18:422–425. [PubMed]
27. Theobald DL, Cervantes RB, Lundblad V, Wuttke DS. Homology among telomeric end-protection proteins. Structure. 2003;11:1049–1050. [PubMed]
28. Theobald DL, Wuttke DS. Prediction of multiple tandem OB-fold domains in telomere end-binding proteins Pot1 and Cdc13. Structure. 2004;12:1877–1879. [PubMed]
29. Theobald DL, Wuttke DS. Divergent evolution within protein superfolds inferred from profile-based phylogenetics. J. Mol. Biol. 2005;354:722–737. [PMC free article] [PubMed]
30. Wels M, Francke C, Kerkhoven R, Kleerebezem M, Siezen RJ. Predicting cis-acting elements of Lactobacillus plantarum by comparative genomics with different taxonomic subgroups. Nucleic Acids Res. 2006;34:1947–1958. [PMC free article] [PubMed]
31. Winter EE, Ponting CP. Mammalian BEX, WEX and GASP genes: coding and non-coding chimaerism sustained by gene conversion events. BMC Evol. Biol. 2005;5:54. [PMC free article] [PubMed]
32. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. [PMC free article] [PubMed]
33. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. [PubMed]
34. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. [PMC free article] [PubMed]
35. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [PMC free article] [PubMed]
36. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004;32:D226–D229. [PMC free article] [PubMed]
37. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. The ASTRAL Compendium in 2004. Nucleic Acids Res. 2004;32:D189–D192. [PMC free article] [PubMed]
38. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. [PMC free article] [PubMed]
39. Schubert HL, Blumenthal RM, Cheng X. Many paths to methyltransfer: a chronicle of convergence. Trends Biochem. Sci. 2003;28:329–335. [PMC free article] [PubMed]
Articles from Nucleic Acids Research are provided here courtesy of
Oxford University Press