Today, biologists are faced with rapidly growing amounts of unknown sequence and structure data related to protein databases. Taking advantage of efficient analysis tools, biologists are highly motivated to derive biological insights from these biomolecules. Sequence comparison tools are commonly used to determine the similarities between proteins with a high degree of similarity, whereas structure comparison methods are essentially utilized to highlight the evolutionary relationships among proteins. Additionally, scientists consider the biological role for these macromolecules as being strongly dependent on their 3D-structure, which has attracted their interest to employ accurate and reliable structure comparison tools with respect to such molecules.
Considering the large amount of unknown data which exists in structural biology, efficient powerful tools are needed to investigate, analyze and classify the properties and functionalities of this data. Furthermore, several tools are available for structural analysis and comparison of biological data. The lack of a universal measurement standard for the evaluation of these methods has persuaded biologists to use different tools and scoring functions in their inquiries. This diversity has provided facilities for biologists to extract their required information for query data more efficiently.
In the structure comparison problem, determining the structural alignment of a protein pair is a fundamental step. The simplest case of the problem occurs when an initial correspondence map between the residue pairs is provided by a sequence alignment procedure. However, providing this initial correspondence map is not possible for protein pairs with low sequence identity. Thus, the methods generally involve comparisons without any prior specified correspondence between residues. Therefore, there is a problem finding the optimal alignment of two structures in unbounded dimensions known as NP-hard [1
]. Many algorithms have been developed using heuristic strategies to compare geometrical coordinates of the Cα
backbone atoms in order to find the best optimal correspondence between residue pairs. The techniques include distance matrices comparison (DALI) [2
], vector alignment of secondary structure alignment (VAST) [3
], combinatorial extension (CE) [4
], matching molecular models obtained from theory (MAMMOTH) [5
], secondary structure matching (SSM) [6
], dynamic programming on TMscore
rotation matrix (TM-align) [7
], genetic algorithm for non-sequential gapped protein structure alignment (GANGSTA) [8
] and many others ([9
]). Several comprehensive reviews and evaluation of the methods have been reported in literatures ([13
In order to overcome the complexity of the structure comparison problem and the difficulties of searching a large structure database, existing methods commonly proceed to represent protein structure in a summarized form. Recently, various methods have been explored to model a 3D-structure of protein in a linear 1D-sequence, in which known sequence alignment tools like FASTA [16
], BLAST [17
] or other language modelling techniques are employed to capture structural similarities among proteins. The methods include protein structure modelling in a set of topology strings (TOPSCAN) [18
], hidden Markov model derived structural alphabet (SA-Search) [19
], representing discrete internal angles of protein backbone as a sequence (YAKUSA) [20
], kappa-alpha (κ, α) plot derived structural alphabet and BLOSUM-like substitution matrix (3D-BLAST) [21
], Structural similarity search by Ramachandran codes (SARST) [22
], structure-to-string translator and a hash table to store n-grams (Lajolla) [23
] and protein structure representation as a bag-of-words of backbone fragments (FragBag) [24
]. Linear encoding techniques adopted from these methods commonly reduce running time of the algorithms as they run hundreds of times faster than geometrical methods like CE [21
]. Additionally, sequence-based schemes are more relevant to be extended for use in multiple structure alignment, fold recognition and genomic annotation studies [22
]. However, 3D-structure conversion into 1D-sequence leads to lose some of the structural details of proteins. Consequently, these methods obtain lower alignment accuracy when compared to highly accurate geometrical search tools. Accordingly, it is a challenging task to develop an algorithm with which the speed advantage of linear encoding techniques with regard to their competitive alignment accuracy can be gained.
Recently, a novel text modelling approach has been developed by this group [25
] in which the structural comparison and alignment of biomolecules can be carried out. The algorithm summarizes protein 3D-structure in a textual sequence and applies a cross-entropy concept over n-gram modelling in order to capture similarities between protein sequences. Considering the fruitful results of this method in terms of accuracy and running speed, an extension of the approach, hereby named TS-AMIR, is introduced to be used for secondary structure modelling in a topology string with the ultimate goal of developing a structural protein alignment tool. TS-AMIR is a simple and fast method with comparable accuracy to state of the art programs.