As the amount of diverse biological data continues to grow, it is important for new methods of analysis to be devised and current methods to be improved. The ability to detect that two proteins have diverged from a common ancestor allows one to infer functional similarity between the two. A common method for identifying similarity between proteins is the use of sequence alignment tools such as FASTA [1
] and BLAST [2
], which provide an alignment of two sequences and a score indicating whether the alignment is significant or could be attributed to chance. The comparison of protein structures allows one to peer back farther into evolutionary time, based on the concept that a form or structure remains similar long after sequence similarity has become undetectable [3
]. There are many methods [7
] and databases [16
] currently available for protein structure comparisons. While the performance of the methods and databases available are for the most part satisfactory, it is not unusual for such methods to miss certain biologically related protein structures that may be identified by human inspection. One may consider two directions when attempting to improve the ability to detect structural similarity. The first is to improve the similarity search method itself, either by using a novel approach for constructing an alignment or by optimizing an existing method. The second approach is to improve the definition of the objects to be compared by the methods. Although initial reflection on the two possibilities may indicate the first may be most fruitful, there is indeed a great deal that may be done with the data itself.
It has long been understood that there is an intermediate organization in proteins, typically called a domain, that is greater than secondary structure and less than the full-length chain of amino acids [20
]. This fact considerably complicates the problems of sequence and structural alignment, because it is possible that two long proteins may contain a similar common domain, which is much smaller than either of the entire proteins. Ideally we want to recognize this situation, but it is difficult to detect true similarity of small subregions while at the same time excluding the small similarities that may occur due to chance. One part of the solution lies in testing for statistical significance of alignment scores or various similarity measures; but even so, it is possible for small but important similarities to be missed. Another part of the solution, which is possible in the case of structure comparison, is to identify the smaller subregions of potential similarity (the domains) and to directly compare them.
Thus, it becomes critical to identify the domains appropriately before performing structure similarity searches. Structurally compact domains are currently being used for computing related structures in MMDB. Recent studies investigating the performance of several structurally based domain parsers in comparison to expert curated structure domain boundaries have indicated the limitations of different methods and potential improvements [23
]. Here we ask the questions, "How often do structurally identified domain boundaries disagree with those determined by sequence conservation" and "Does either domain type perform better in structure similarity searches when disagreement occurs?"
In this study, taking advantage of the availability of large collections of manually curated domains based on conservation among sequences across a protein family, we investigated how the structure search performance of VAST would be affected when using sequence-based and structure-based protein domains. A sequence-based domain can generally be defined as an evolutionarily conserved region of a protein. Domains of this type are identified as similar blocks of residues occurring in several proteins. Currently there are many databases of these domains such as Pfam[25
], and the Conserved Domain Database (CDD) [28
], generally built up from multiple sequence alignments and hidden Markov model methods. A structure-based domain can be defined as a three dimensionally isolated region, and is considered by some to correspond to a compact folding unit. Structure-based domains are generally identified by manual inspection of a protein structure, as is the case with the Structural Classification of Proteins (SCOP) database[17
], or by computationally delineating compact substructures as is done in the MMDB. Although in most cases the domains that are derived from sequence and structural ideas are consistent, there are times when the boundaries do not agree. An example of a common domain boundary disagreement can be seen in human ABL1 tyrosine kinase (PDB id: 2FO0
, chain: A) (Figure ) and is observed in many kinases. In this instance, the MMDB structure domain parser has divided the chain into four domains, as there are four geometrically distinct regions. However, two of the structure domains occur together with similar residue content across a diverse range of species in sufficient instances to be identified as a single domain, the tyrosine kinase catalytic domain (CDD id: cd00192), based on sequence analysis. In this case, we would want to investigate if combining both structure-based domains into a single domain would allow for detection of similarity to other kinases while avoiding detection of similarity to unrelated structures where smaller regions share a common arrangement of helices and strands.
Figure 1 Example of sequence/structure domain disagreement due to difference in concept. The structure of ABL1 tyrosine kinase (PDB id: 2FO0, chain A) and the sequence domains SH3, SH2, and tyrosine kinase C-terminal region (CDD ids: cd00174, cd00173, cd00192). (more ...)
In this work, we first systematically compared the domain boundaries of the sequence-based domains in the Conserved Domain Database to the structure-based domains in the medium redundancy subset of MMDB. We have identified a noticeable fraction of sequence based domains that differ significantly from those derived based on structural compactness. The new domains were then used as queries in identifying related structures using VAST and changes in structure similarity search results were analyzed. Using SCOP as a standard of truth, interesting cases were observed where the new domain boundaries perform better than the original domains in terms of homologous structure recognition. We have also found that the overall performance of sequence domains is comparable to that of whole chain and structure domain based queries.