Correct functional and structural protein classification requires an understanding of the underlying phylogenetic relationships between existing proteins. The protein polypeptide chain folds into a stable, unique, highly ordered conformation, which is necessary for maintaining its particular function. Many observations strongly suggest that protein evolution takes place under strong structural constraints and, as a result, proteins that drifted apart over time due to divergent evolution may still exhibit structural resemblance despite the absence of detectable sequence similarity. Such proteins are examples of remote homologs sharing the same evolutionary origin. Homology in these cases can be inferred by similarity in function and/or by the presence of conserved atypical sequence or structural features.1
Structural similarity, however, does not necessarily imply evolutionary divergence. It is believed that similarity in overall protein topology can occur independently due to the limited number of topological arrangements or folding patterns.2– 6
This type of similarity caused by convergent evolution is usually referred to as “analogous.”
Several studies have addressed the problem of distinguishing structural similarity due to common origin versus convergent evolution. Russell et al.,5,7
for example, found that secondary structures and sequence similarity were more conserved in remote homologs compared to analogs, whereas substitution matrices derived from homologous proteins preserved amino acid chemical properties and performed quite well in homology recognition. The success rate in fold recognition experiments was also shown to be much higher for homologs compared to analogous fold pairs.7,8
At the same time, it has been observed that the degree of conservation of chemical properties in proteins decreases quite rapidly with decreasing sequence similarity for both homologs and analogs, which makes their populations almost indistinguishable at large evolutionary distances.9,10
Indeed, several observations have indicated that various measures of pairwise sequence and structure similarity such as sequence identity, root-mean-square superposition residual (RMSD), the proportion of conserved side-chain contacts, and others do not distinguish well between remote homologs and analogs, which suggests that other aspects of protein similarity should be taken into account.4,10,11
The correct classification of homologous and analogous proteins requires a choice of sensitive variables of structural, sequence, or functional similarity. So far, the comparative analysis of proteins has primarily focused on those regions that are recognizably conserved and aligned by various methods. The most commonly used measures of similarity were based on comparing the sequence and structural features in equivalent aligned positions. However, given an alignment, the conserved regions are separated by nonconserved ones, where the structures and sequences locally deviate from each other, so that they do not superpose well. Such regions, which mostly occur via insertion or deletion (indel) events, appear to be not very critical for structural integrity but may be quite crucial for inferring the phylogenetic history of a protein family. Modeling of insertion– deletion events in evolution is a particularly difficult task, and many researchers simply tend to ignore alignment uncertainty during the reconstruction of evolutionary events. Traditionally, in order to score insertions or deletions in sequence alignments, affine gap penalties have been used despite the fact that this simple model does not adequately describe the evolution of indels.12–14
It was observed several years ago that the probability of a gap in the alignment of two protein sequences is a function of evolutionary distance between two homologous proteins, and there exists a linear relationship between the number of residues in indels and evolutionary distance.15,16
One possible explanation of this observation would suggest an incremental change in loops by stepwise insertion or deletion processes.17
At the same time, it was shown that most of the structural variation in aligned regions of homologous proteins is strongly correlated to the changes in sequence,9,18 –21
while the structural variation among nonhomologous proteins is not coupled with the sequence similarity.20,21
Based on the aforementioned observations, one might argue that more closely related proteins might differ less in their nonaligned regions compared to the distantly related proteins; the degree of variability in loop regions in structural analogs should be higher than in homologous proteins and in general should not depend on evolutionary distance. Therefore, one might gauge the protein relatedness by using, in some way, the degree of difference displayed by the nonconserved loop regions.
In this article, we describe a new similarity measure that takes into account the degree of structural difference in nonconserved, looped out regions of proteins. This new measure is based on the Hausdorff metric, which is used in the branch of mathematics known as topology to define a distance measure between point sets of a metric space. Using the benchmark of homologous and analogous protein structures as a merit of success, we compare the loop-based Hausdorff measure (LHM) to the conventional quantities based on scoring the similarity in the aligned regions. We show that scoring based on loop regions of protein domains can be as sensitive as conventional scoring in discriminating analogous and homologous folds. Moreover, we show that the new similarity measure can be successfully applied to test the evolutionary relatedness between different proteins of the most populated super-folds.