The 3D structure determination of a certain protein greatly helps unravelling its function and binding mechanisms. Such structural
information can also aids in designing experiments in mutagenesis and even utilized for structure-guided drug development or virtual
screening [
1–
2]. Since experimental
structures are available only for a small number of sequenced proteins, alternative strategies are required to predict reliable models
for protein structures when X-ray diffraction or NMR are not yet available[
3].
Among the different strategies currently used for constructing 3-D structures of certain proteins, we shall find the comparative
modelling (termed also as homology modelling) as the most accurate method among the computational methods, yielding reliable models
[
4–
5].
Another approach termed “ab-initio” modelling is not practical yet for the construction of reliable models
[
6]. Usually, in comparative modelling the template is chosen by virtue of having
the highest level of sequence similarity with the target, and similar secondary and tertiary structure (belongs to the same
“fold”). Baker and Sali [
7] have shown that a comparative model for
a protein at medium size at least and with sequence identity of less than 30% to the template crystal structure is unreliable.
The rule of sequence identity score exceeding 30% does not specify how identity should be distributed along a sequence. The quality of
the models is assessed by comparing predicted structures to X-ray solved structures via superimposition and atomic root mean square
deviation assessment (RMSD). A model can be considered ’accurate‘ or ’reliable‘ model when its RMSD is
less than 3‐4 Å.
The comparative modelling procedure for protein structure prediction is built generally from few steps: after identification of the
homologous protein with known 3-D structure, sequence alignment (based on score of identity or similarity) is performed. Usually, the
structurally conserved regions (SCRs) are identified and coordinates for the core of the models are generated. Following the core
generation, one predicts the conformations of the structurally variable regions (termed loops)[
8]
and adds the side chains [
9]. Some approaches, align multiple known structures
firstly, then, identifying structurally conserved regions to construct an average structure, for modelling these regions of the inquiry
protein. The optimal homology-based model is obtained when the correct template is chosen and each residue pair correctly aligned in
the target-template sequence alignment[
10].
In this communication, we carried out an analysis of a large set of 4753 sequence and structure alignments and tried to answer few
questions: (1) Can we predict the accuracy of the modelled structure based on sequence identity score? (2) Is it always justified to
select the protein with highest identity score as a template for comparative modelling? (3) How can we improve accuracy of
homology-based models?