|Home | About | Journals | Submit | Contact Us | Français|
COMPASS is a method for homology detection and local alignment construction based on the comparison of multiple sequence alignments (MSAs). The method derives numerical profiles from given MSAs, constructs local profile-profile alignments and analytically estimates E-values for the detected similarities. Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the latest version of COMPASS, which provides (i) increased sensitivity and selectivity of homology detection; (ii) longer, more complete alignments; and (iii) faster computational speed. After submission of the query MSA or single sequence, the server performs searches versus a user-specified database. The server includes detailed and intuitive control of the search parameters. A flexible output format, structured similarly to BLAST and PSI-BLAST, provides an easy way to read and analyze the detected profile similarities. Brief help sections are available for all input parameters and output options, along with detailed documentation. To illustrate the value of this tool for protein structure-functional prediction, we present two examples of detecting distant homologs for uncharacterized protein families. Available at http://prodata.swmed.edu/compass
Accurate detection of sequence similarity between distantly related proteins is essential for many fields, including protein structure prediction, protein engineering, and comparative genomics. The performance of an automatic method for sequence comparison can be characterized by sensitivity, selectivity and accuracy of produced sequence alignments. All these parameters can be significantly improved by comparing multiple sequence alignments (MSAs) rather than individual sequences. The improvement comes from evolutionary information about residue preferences at sequence positions in the family represented by the MSA. This information can be extracted from MSAs in two numerical forms: ‘traditional’ position-specific profiles and hidden Markov models (HMMs). The well-known and popular methods for profile-sequence or HMM-sequence comparison include PSI-BLAST (1,2), HMMER (3), SAM-T (4,5) and others. A newer generation of methods involves the comparison of two profiles (6–10) or two HMMs (11,12), with several corresponding web servers available (13–16). These methods further improve the quality of homology detection and alignment construction (17,18). There is a number of publicly available web servers aimed at protein structure prediction that use these and a variety of other techniques [for example, (19–23)].
COMPASS (9) is an established method for profile-based comparison of MSAs. COMPASS derives numerical profiles from given MSAs, constructs optimal local profile-profile alignments, and analytically estimates E-values for the detected similarities. As previously shown by us (9) and independently verified by others (12,18), COMPASS is a sensitive and selective tool for detection of remote sequence similarity that offers accurate local alignments. In many cases, COMPASS provides accurate homology detection and structure prediction that would be difficult or impossible to produce by PSI-BLAST (9,24).
As a standalone package, COMPASS has been used by different research groups (24–31). Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the recently improved version of COMPASS.
To compare two MSAs, COMPASS performs four steps: (i) processing input MSAs and generating numerical profiles; (ii) calculating scores between individual positions of the compared profiles; (iii) finding optimal local alignment of the two profiles; and (iv) assessing statistical significance of the optimal alignment score (9).
Methodologically, COMPASS is a generalization to profile-profile comparison of the PSI-BLAST approach to profile-sequence comparison. Numerical profiles represent effective counts and frequencies of 21 symbols (20 residue types and gaps) at each position of the input MSAs. To search with a query MSA against a database of MSAs, the database profiles are pre-computed in advance. Scores for the similarity between individual profile positions are calculated using our original formula (9) and then rescaled so that their distribution is similar to a standard distribution with well-known properties (such as BLOSUM62 substitution scores). Rescaled positional scores are used to find the optimal local alignment using the Smith–Waterman algorithm. The statistical significance of the optimal alignment score is estimated using a simple formula for E-value (the expected number of hits in a random database with a score equal to or greater than the observed score). The parameters of this formula are based on our extensive simulations of random profile comparisons (9). As the final result of the search, a list of the most significant hits for the submitted query is displayed, followed by the optimal profile-profile alignments.
According to our results (9) and independent evaluations (12,18), COMPASS performance has been demonstrated to be among the top methods for profile comparison, by both the quality of homology detection and the accuracy of local alignment construction. The presented web server features a newer version of COMPASS, with several major modifications to improve performance.
In order to gain more confidence in detected similarities and to find the best search conditions for a specific query, tuning the parameters controlling the generation of profiles and the construction of profile-profile alignments is advisable. The user can modify several such parameters. First, the input MSA (or sequence) can be used as a query for PSI-BLAST search, in order to produce a more diverse MSA of this family. The user can adjust the maximal number of iterations, as well as the requirements for a detected homolog to be included in the alignment: maximal E-value, minimal coverage of the query and minimal sequence identity to the query. Second, ‘Gap fraction threshold’ allows the user to control the maximal content of gaps in the MSA columns included in the COMPASS profile. If a column contains too many gaps, it is disregarded in the process of profile comparison, and shown in the final output as lower-case letters for residues and dots for gaps. The default value of this parameter is 0.5.
In the construction of profile-profile alignments, ‘Gap penalties’ are score penalties for opening and extending a new gap. ‘Effective length of the database’ is the parameter used in the calculation of E-values for the profile-profile alignments. For a given optimal alignment score, there is roughly a linear dependence of E-value on the assumed database length. ‘Matrix’ is a substitution matrix of the user's choice, BLOSUM62 by default. As described above, the choice of the matrix affects the rescaling of scores between individual profile positions that are used in the construction of the profile-profile alignment. Changing the scale of the positional scores would (i) make gap insertion more or less likely, affecting the resulting alignments, and (ii) change the optimal alignment scores and E-values.
Among the output formatting options, many are similar to those of PSI-BLAST. ‘Expect’ and ‘significance threshold’ are, respectively, the E-value cutoffs for the hit to be included in the output and to be considered significant. The hits outside the significance threshold are shown as potentially not meaningful. The user can also limit the total number of hits to display (‘Display up to’). Some output options are specific to profile-profile comparison. For example, the displayed profile-profile alignments can include different numbers of top sequences from the input MSAs (‘Top sequences to show’), as well as consensus sequences (‘Show consensus sequences’). Brief help sections are provided for every adjustable parameter, as well as a link to more detailed documentation (Figure 1A).
As an illustration, we describe the detection of distant sequence similarities that lead to fold predictions for two uncharacterized PFAM families annotated as ‘DUF’ (domain of unknown function). First, the COMPASS server detects homology between DUF185 (corresponding to COG1565 of the COG database) and SCOP domains of the S-adenosyl-l-methionine-dependent methyltransferase (SAM-Mtase) fold. Using the full DUF185 (PFAM 19.0) alignment as a query, with the default input parameters (Figure 1A), the server returns a list of hits that consistently belong to the same SCOP superfamily (c.66.1), both above and below the E-value cutoff (Figure 1B). In this list, each line consists of four fields: the identifier in the original database (implemented as a link to the database), a brief description of the protein, the COMPASS score and the corresponding E-value.
The next section of the output includes profile-profile alignments between the query and the hits. Each alignment is accompanied by a header with a brief information about the hit. Unlike the PSI-BLAST format, the alignments can include different numbers of top sequences from input MSAs and/or consensus sequences. Figure 1C shows an example of such an alignment, with a single top sequence and consensus displayed for each profile. To distinguish the gaps introduced by COMPASS from the gaps that already occur in the input alignments, the former are shown as equal signs (=). The alignment in Figure 1C includes the region of similarity between the query (profile for DUF185) and a homologous profile based on the PSI-BLAST alignment for structural domain 1i4wA. In addition to similar patterns of hydrophobicity and small residues, DUF185 shows a strong conservation of SAM-Mtase signature motifs [reviewed in (39)]. The SAM-binding loop GxGxG (circled) and conserved acidic residue in the preceding β-strand (marked with a red dot) are parts of Motif I, whereas the invariant glutamate at the end of the next β-strand (marked with a red dot) is a part of Motif II (39).
This previously published prediction had been difficult to produce by PSI-BLAST, even for an expert user (24). However, it was more recently confirmed by the solved structure of a DUF185 member. This structure (PDB ID 1zkd, Northeast Structural Genomics Consortium) has been neither functionally annotated nor classified by SCOP or CATH, but possesses typical features of the SAM-Mtase fold (Figure 1D). The core of the domain contains a mixed β-sheet of seven β-strands surrounded by two sheets of α-helices. The strand order is 3214576; with strand 7 (colored red) anti-parallel to the rest and forming a characteristic methyltransferase β-hairpin with strand 6 (colored orange). In this domain, the β-hairpin contains an additional α-helical insert (orange helices). The presence of a glycine-rich loop (circled) and other signature motifs, including glutamates marked in Figure 1C (side chains shown in red), suggest that this domain is a functional methyltransferase.
The second prediction originates from searching with RrnaAD methylase family as a query. This search reveals a newly identified similarity to a PFAM family of mainly hypothetical bacterial proteins with unknown structure and function, DUF519 (corresponding to COG2961 in the COG database). Thus, we suggest that DUF519/COG2961 proteins also possess the structural SAM-Mtase fold. This hypothesis is supported by the results of a search with the PFAM 19.0 DUF519 alignment as a query against the database of SCOP profiles (PSI-BLAST iteration 3). Homologs detected above the significance threshold, as well as multiple hits below the threshold, consistently belong to the SAM-Mtase fold.
Figure 2A shows the COMPASS alignment between DUF519 and the detected homolog, a domain of the SAM-Mtase fold (PDB ID 1qyrA). This domain (not shown) possesses typical features of the fold and is similar to the structure shown in Figure 1D. Figure 2A shows the COMPASS alignment including the signature motifs of SAM-Mtases. Figure 2B shows the MSA of representatives from both families that covers SAM-Mtase Motifs I and II (39). In DUF519, this region includes the invariant glutamate aligned to a ligand-binding glutamate of SAM-Mtases (E95 in the top sequence, marked with red dot), the characteristic location of conserved small residues in the SAM-binding loop (marked with a line) and a similar hydrophobicity pattern. Secondary structure prediction for this part of DUF519 is also consistent with the secondary structure of the SAM-Mtase fold. This prediction is additionally supported by other tools, e.g. by (i) significant scores for the similarity with the SCOP SAM-Mtase domains produced by FFAS03 server (14); and (ii) the results of multiple iterations of PSI-BLAST search in a sequence database with a family representative as a query. After four iterations, PSI-BLAST detects the similarity between a DUF519 sequence Q9PHA1_XYLFA (gi|15836648, residues 32-291) and two proteins of known structure possessing the SAM-Mtase fold (PDB IDs 2ift and 2fpo).
The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing high-performance computing resources. We would like to thank Lisa Kinch and James Wrabl for discussions and critical reading of the manuscript. Funding to pay the Open Access publication charges for this article was provided by Howard Hughes Medical Institute.
Conflict of interest statement. None declared.