As an example we analyze the sequence of Stage V sporulation protein T (SpoVT) from Bacillus subtilis
that is known to regulate forespore-specific σG
-dependent transcription (32
) (annotated as ‘transcriptional regulator’ in GenBank). Input parameters are set as shown in . The results consist of two parts (): a summary list with matching database sequences (‘templates’) and a list of query–template alignments below.
Figure 2 Search results for SpoVT from Bacillus subtilis. The summary hit list at the top shows that SpoVT consists of two domains: the N-terminal domain is very similar to AbrB (rank 1) and clearly homologous to MazE (rank 3) and the C-terminal domain is similar (more ...)
The first column of the summary hit list has indices that link to the corresponding alignment further down. Next are the first 30 characters from the description of the HMM. The ‘Prob’ column lists the probability in percent that the database match is a true positive, i.e. that it is homologous to the query sequence at least in some core part. This is the most relevant statistical measure of significance and can be interpreted quite literally. The true-positive probability is a conservative measure in the sense that it corrects for occasional high-scoring false positives. (The major cause for high-scoring false positives are corrupted alignments that contain non-homologous sequences which slipped in during the automized alignment-building with PSI-BLAST.) [See (28
) for details.] The E
-values in HHpred are defined in the same way as in BLAST or PSI-BLAST. (The E
-value for a sequence match is the expected number of false positives per database search with a score at least as good as the score of this sequence match.) But it is important to note that, in contrast to the true-positive probability, HHpred E
-values do not take into account the secondary structure similarity. Hits can therefore be significant by the true-positive probability criterion even when the E
-value is ~1. The P
-value is equal to the E
-value divided by the number of HMMs in the searched database. The ‘Score’ column gives the total score that includes the score from the secondary structure comparison which is listed in the next column (‘SS’). ‘Cols’ contains the total number of matched columns in the query–template alignment and the remaining columns describe the range of aligned residues in the query and template.
From the summary list in it is evident that the SpoVT protein consists of two domains, one from residue 1 to ~51 and the other from residue 52 to 178. The N-terminal domain has two significant hits in SCOP at rank 1 and 3. The first hit is the DNA-binding domain of transition-state regulator AbrB (33
), a known close homolog of SpoVT. AbrB is a protein that is broadly represented in bacterial species and is involved in switching from exponential growth to stationary phase by integrating a great number of environmental factors. The second hit is to MazE, the antidote of the antidote-toxin addiction module MazEF (34
). How can both AbrB and MazE be homologous to the query if they are not even classified into the same class, let alone fold or superfamily, by the SCOP database? Can the match with MazE be a false positive despite the rather significant 84% probability?
To elucidate this, we can look at the SpoVT–MazE alignment below. Five representative (i.e. maximally diverse) sequences from each of the two underlying alignments are shown for each HMM. (Their amino acids can be colored by biochemical properties by pressing one of the radio buttons entitled ‘color alignments’ above the summary hit list.) First we note that the predicted secondary structure of SpoVT (sequence ‘Q ss_pred’) agrees very well with the actual secondary structure of MazE determined by the program DSSP (sequence ‘T ss_dssp’). Second, the hydrophobicity pattern in the aligned HMMs looks quite similar, which is especially evident with the coloring. Third, the HMM–HMM alignment contains a single gap in MazE at a position where also some sequences in SpoVT exhibit a gap. All in all, the alignment looks very much like what one would expect for a distant homologous relationship.
The conflict posed by the manifest homology between MazE and AbrB and their grossly different structural topology prompted us to undertake a thorough bioinformatic investigation of the AbrB-like superfamily and to redetermine the AbrB structure by NMR (M. Coles and S. Djuranovic et al., manuscript submitted, PDB ID: 1YFB). Indeed, we found that the published structure of AbrB (PDB ID: 1EKT) is incorrect and that the correct structure for AbrB places it in the same superfamily as MazE.
Hits 2 and 4–9 in the summary list are all proteins from the same SCOP fold d.110
. Clicking on the SCOP family IDs opens a window with the corresponding entry in SCOP. Irrespective of the specific significance values, the fact that so many quite divergent members from the same two superfamilies d.110.2
(GAF-domain) and d.110.3
(PAS-domain) appear among the best hits strongly indicates that these are not high-scoring chance hits but true homologs. Whether the C-terminal domain looks more like a GAF or a PAS domain, we can now generate an approximate structural model that could help us to guide experiments to investigate what regulatory substrate this domain may actually bind (32
By clicking ‘Create CM Model’ one can select the templates to be used for comparative modelling. HHpred then returns a multiple alignment in PIR format with the query sequence and the selected templates. This aligment may be edited by the user and then fed to the MODELLER software (35
), accessible via the MPI toolkit for users of HHpred.
A very useful feature is the possibility to view and manually improve the query alignment that was used to generate the query HMM; via the tab ‘Edit Query Alignment’ the user can modify the query alignment that appears in a text field and start a new search with the modified alignment.
By pressing ‘Realign’ at the top, the user may also realign the identified templates in the summary hit list with different parameters without the need to rerun the database search. One can change the alignment mode from global to local, set the number of representative sequences or use filters to narrow down the set of sequences allowed into the query and template alignments. If the user wants to search another database with the same query HMM, she can select ‘Restart with Query HMM’.