|Home | About | Journals | Submit | Contact Us | Français|
Summary: Profile-based similarity search is an essential step in structure-function studies of proteins. However, inclusion of non-homologous sequence segments into a profile causes its corruption and results in false positives. Profile corruption is common in multidomain proteins, and single domains with long insertions are a significant source of errors. We developed a procedure (HangOut) that, for a single domain with specified insertion position, cleans erroneously extended PSI-BLAST alignments to generate better profiles.
Availability: HangOut is implemented in Python 2.3 and runs on all Unix-compatible platforms. The source code is available under the GNU GPL license at http://prodata.swmed.edu/HangOut/
Supplementary information: Supplementary data are available at Bioinformatics online.
PSI-BLAST (Altschul et al., 1997) is an indispensable tool for remote homology inference and structure-function predictions (Devos and Valencia, 2000; Friedberg, 2006; Grishin, 2001; Hegyi and Gerstein, 2001). However, false positives in PSI-BLAST can cause errors in automated annotations (Bork and Koonin, 1998). One major source for such false positives is profile corruption, usually resulting from extension of alignments over non-homologous sequence regions (Galperin and Koonin, 1998). For instance, for two 2-domain proteins, AB and A′ C, PSI-BLAST may extend a correct alignment of the homologous domains A and A′ to include sequences from the non-homologous domains B and C. Despite significant effort devoted to this multidomain problem, no satisfactory solution exists (Gonzalez and Pearson, 2010; Galzitskaya and Melnik, 2003; George and Heringa, 2002; Nagarajan and Yona, 2004). Currently, the best approach is to start PSI-BLAST with precisely defined query sequence bounds (Corpet et al., 2000; Wheeler et al., 2001).
However, we found that even a single, well-defined domain does not guarantee a corruption-free profile. Domains hosting insertions, which represent close to 5% of domains in the structural classification of proteins (SCOP) 1.75 database (Murzin et al., 1995), may generate a corrupted PSI-BLAST profile due to incorrect alignment extension around the insertion position. Our analysis shows that the N- and C-terminal segments of the host domain are frequently aligned as separate PSI-BLAST high scoring pairs (HSPs), and the two HSPs overlap when mapped onto the query sequence. Each alignment can be divided into two segments: (i) correctly aligned and (ii) incorrectly aligned or extended (Fig. 1a and Supplementary Fig. S1). These incorrectly aligned ‘overhangs’ are detected and removed by the HangOut program to clean the profile and prepare it for consequent remote homology searches with various tools, such as PSI-BLAST and HHsearch.
The HangOut input is a single domain query sequence with the insertion boundary specified. The HangOut algorithm proceeds as follows (Fig. 1a): (1) Run BLAST with the input sequence against the NCBI non-redundant database with e-value threshold 0.001. (2) Detect and remove lower-scoring (see second half of this paragraph for clarification) regions from HSPs and regions matching a PSI-BLAST profile of the inserted domain (see Supplementary Figures S2 and S3 for rationale). (3) Terminate upon convergence or iteration limit. Otherwise, repeat Steps 1 to 3 with the following modifications: (i) PSI-BLAST replaces BLAST, seeded (-B option) with the cleaned profile from Step 2 and (ii) profile scores (PSSM) replace BLOSUM62 scores (for HSP removal). Thus, HangOut builds multiple sequence alignments similarly to PSI-BLAST, but has a ‘clean-up’ step after each iteration intended to remove incorrect extensions. HangOut is based on two assumptions: (i) each HSP contains at least one correctly aligned region, and (ii) incorrectly extended regions exist in every HSP that crosses the insertion point. Based on these assumptions, HangOut splits all local alignments into two segments with a boundary at the insertion point and selects the best scoring (BLOSUM62 or PSSM) segment out of each split pair. The lower scoring segment is removed as a possibly erroneous extension. In addition to this HangOut procedure, we applied RemoveHit, a simpler method that does not require a defined insertion point and removes entire alignments for hits with two overlapping HSPs (Supplementary Fig. S4).
HangOut was tested on a set of 40% representative SCOP 1.75 domains defined to contain insertions (302 domains, see Supplementary Table 1 for the list) to measure the number of corrupted profiles (false positives) and the number of correct homologs found by each discontinuous query domain sequence (with insertion sequence removed). The 302 hidden Markov Models (HHMs) built from each PSI-BLAST profiles, HangOut profiles or RemoveHit profiles were compared to HHMs built from all 9528 SCOP 1.75 40% representative domains (Murzin et al., 1995) using HHsearch ver. 1.5.1 (Soding, 2005). The number of corrupted profiles was increased by one if HHsearch found homologs of inserted domains with probability higher than 0.9. The number of homologs found are counted by the number of hits that have strong profile similarity (HHsearch probability above 0.9) and overall structural similarity (DaliLite Z-score higher than 4) (Holm and Park, 2000) or belonged to the same SCOP superfamilies as the query domains.
HangOut is intended to clean PSI-BLAST generated profiles of erroneous extensions caused by domain insertions. One typical example of this domain problem is shown in Figure 1b: an α/β P-loop hydrolase (yellow in Fig. 1b) is inserted into an α-helical bundle (blue and red in Fig. 1b). Corruption of the PSI-BLAST alignment built from hits to the α-helical bundle is evidenced by a profile-based similarity search (HHsearch), which finds the α/β P-loop hydrolase domain with probability 98%. Since the query α-helical bundle does not share any sequence or structural similarities with the hydrolase domain, the high HHsearch probability results from profile corruption (for details see Supplementary Fig. S2).
Given the success of this example, we tested the ability of HangOut to clean profiles of all SCOP domains with defined insertions (302 domains). As a basis for comparison, 91 PSI-BLAST profiles (30%) were corrupted. RemoveHit cleans only 23 of these profiles, while HangOut cleans all but one (Fig. 1c). The single exception is probably due to distant homology, since both the host and inserted domain represent similar doubly wound Rossmann folds (Supplementary Fig. S5). Because the removal of sequence segments from alignments may deprive the profile, we also checked for the loss of true hits. Surprisingly, cleaned HangOut profiles retained ~98% of the homologs found by PSI-BLAST profiles (99.6% for RemoveHit), suggesting that useful information is not lost from the profiles. Compared to RemoveHit, the complexities of HangOut that use domain boundary information are apparently needed to clean corrupted profiles. The presence of overlapping HSPs (removed by RemoveHit) does not sufficiently indicate corrupted segments. For remote homologs, only a single HSP may be found and incorrectly extended to cover part of the insertion. Although our current HangOut procedure does not offer a comprehensive solution to the multidomain problem, it addresses a special case of domains with insertions that represent the major source of profile corruption when PSI-BLAST is initiated with single, discontinuous domain queries. HangOut will be especially useful for large-scale bioinformatics efforts that are initiated from defined structure domains and require uncorrupt sequence profiles for subsequent analysis. Additional work will be done to offer a general solution without prior knowledge of domain boundaries.
The authors thank Lisa N. Kinch, Jimin Pei and Jeremy Semeiks for helpful comments.
Funding: Welch Foundation I1505 (to N.V.G.).
Conflict of Interest: none declared.