This section contains (i) a machine learning method to detect HSFBs, using both local and global structure information; and (ii) a new scoring function and an optimization algorithm to search for the best MSA starting from the HSFBs.
2.1 HSFB
We use both local and global structure environments to determine how likely two residues should be aligned. To the best of our knowledge, no previous methods consider the global structure environment. The local structure environment of a residue consists of 10 structure segments of length 2k+1(k=1, 2,…, 10) centered at this residue. Given two residues of two different proteins, the similarity of their local structure environments is measured by 10 TMscore values, each measuring the similarity of two structure segments of the same size. The global structure environment of a residue i is defined as follows. Let N(i, d) denote all the residues in the same protein within distance cutoff d from i. The global structure environment of i (under a given distance cutoff d) consists of all the 5mer segments centered at the residues in N(i, d). Given two residues of different proteins, to calculate the similarity of their global structure environments, we first generate a rigid body transformation for them by minimizing the RMSD of two center 5mer segments. Then we fix this transformation and run dynamic programming to align them by maximizing TMscore. We generate 10 such global structure environment features, using 10 distance cutoff values 6, 7,…, 15 Å.
Modeling pairwise structure alignment using CRF: CRFs are probabilistic graphical models that have been applied to protein secondary structure prediction (
Wang et al., 2010), protein conformation sampling (
Zhao et al., 2010a;
Zhao et al., 2010b), protein sequence alignment (
Do et al., 2006) and protein threading (
Peng and Xu, 2010). Here, we use CRF to model protein structure alignment, which then is used to identify HSFBs among proteins under consideration.
Let
p,
q denotes two input proteins and their associated structural features (i.e. local and global structure environment similarity scores). Let
X={
M,
Ip,
Iq} be a set of three possible alignment states. Meanwhile,
M indicates that two residues are aligned and
Ip and
Iq indicate insertion at proteins
p and
q, respectively. Let
A={
a1,
a2,…,
aNA} denote an alignment between
p and
q where
ai
X represents the state (or label) at position
i and
NA denotes the length of this alignment. Our CRF model defines the conditional probability of an alignment
A given
p and
q as follows:
and θ={λ
1, λ
2,…, λ
d} is the model parameter and
F(
A,
p,
q,
i) is a function estimating the log-likelihood of an alignment at position
i:
where
ek(
ai−1,
ai) and ν
l(a
i,
p,
q,
i) are called edge and label feature functions, respectively. The edge features model the dependency of the state transition from alignment position
i −1 to
i. Here, we assume
ek(
ai−1,
ai) is independent of protein features to make our formulation simple.
ek(a
i−1,
ai) is equal to 1 if the transition
ai−1→
ai exists at position
i; otherwise 0. We forbid transition from
Iq to
Ip, so there are in total eight state transitions. That is,
k ranges from 1 to 8. The label features model the relationship between
ai and the local and global structure environment similarity scores at the alignment position
i. There are in total 20 different label feature functions, each corresponding to one local or global structure environment similarity score. Therefore,
l ranges from 1 to 20. To make the formulation simple, we assume an insertion state is independent of protein features, so modeling of insertions is implicitly taken into consideration in the edge feature functions.
We train the model parameters using a set of reference alignments taken from FSSP (
Holm and Sander, 1994). In particular, we randomly selected 50 pairs for training and the other 50 pairs for test. The training and test sets have no overlap with our benchmarks. All the structural alignments (i.e. reference alignments) were generated by DALI (
Holm and Sander, 1993) and each alignment has a DALI Z-score >8.0. The model parameters are initialized randomly to a value between 0 and 1 and the training process converges within ~100 iterations. Five-fold cross-validation is conducted to determine the regularization factor in the CRF model. Once the model parameters are determined, for a given protein pair
p and
q, we can generate their alignment by maximizing
Pθ(
A|
p,
q). This alignment may not be the best, but can be used as an initial alignment between
p and
q. Starting from these initial alignments, we can build very accurate pairwise alignments (
Supplementary Material).
As shown in
Supplementary Figure S1, the global features corresponding to distance cutoffs 7, 8, 14 and 15 Å have relatively large weight factors. This is consistent with the findings in
da Silveira et al. (2009), which shows that 7 Å is the best cutoff for distance-based contact definition. Both 14 and 15 Å may be interpreted as the distance cutoffs for second-order contacts. To speed up, we exclude six local structure environment features and eight global structure environment features with small weight factors from our final CRF model. By using only six features, we will not lose much accuracy in identifying HSFBs. The remaining two global features correspond to global structure environments at radius 8.0 and 14.0 Å. The remaining four local features correspond to structure segments with lengths 9, 13, 17 and 21, respectively.
Detecting HSFBs using CRF: given two protein structures, we can calculate the marginal probability of two fragments being aligned using the forward–backward algorithm (
Lafferty et al., 2001). In this article, we consider only fragments of length
L=12. Such a fragment is likely to cover at least one secondary structure segment. A slight change of
L will not impact the final result much. This marginal probability is defined as the similarity score of two structure fragments. Given a short fragment
F1 in protein
p1 and another protein
pi, let
Fi denote the fragment of the same size in
pi, which has the highest similarity score with
F1. All the fragments
F1,
F2,…,
FM form an HSFB with
F1 being the pivot fragment. A protein of size
N in total have
N −
L+1 HSFBs. Note that the ‘highest similarity’ relationship is asymmetric. That is, among all fragments in protein
pi,
Fi is the most similar one to
F1 may not imply that among all fragments of
p1,
F1 also is the most similar one to
Fi. Therefore, given
M proteins with lengths
N1,
N2,…,
NM, there are in total (∑
i=1MNi)−ML+
M HSFBs.
Ranking HSFBs by spatial consistency: two HSFBs may be geometrically inconsistent with each other. That is, we cannot superimpose well the fragments in both HSFBs using a single set of rigid body transformations. We can calculate the degree to which two HSFBs are geometrically consistent and then rank all the HSFBs according to their consistency with others. The HSFB with the highest consistency score is very likely contained in the best MSA. We use a simple method to estimate the consistency score of one HSFB as follows. For each fragment in the HSFB, we generate a rigid body transformation by minimizing the RMSD between this fragment (
Kabsch, 1976) and the pivot fragment. Let
B1={
F11,
F12,…,
F1M} denote the HSFB for which we want to calculate its consistency score and
F11 is the pivot fragment. Let
Ti(
i=2, 3,…,
M) denote the rigid body transformations derived from superimposing
F1i to
F11. Let
B2={
F21,
F22,…,
F2M} denote another HSFB. For any fragment
F2i (
i=2, 3,…,
M) in
B2, we superimpose
F2i to
F21 using the transformation
Ti and then calculate the RMSD between
F2i and
F21. If the distance is within 3 Å, we increase the consistency score of
B1 by 1, otherwise by 0.
2.2 Algorithm for MSA
Overview: as shown in , 3DCOMB first generates a list of pivot structures. By default, this list contains all the input proteins, so TopK is equal to M. For each pivot structure, 3DCOMB uses the CRF model to generate HSFBs, which are ranked by their spatial consistency scores and only the TopJ with the highest scores are extended to initial MSAs. 3DCOMB identifies those ‘unanchored’ proteins which are not well aligned to the pivot. To improve an initial MSA, 3DCOMB conducts TopF trials to realign each of the ‘unanchored’ proteins to the pivot. Finally, 3DCOMB refine the whole MSA based on the consensus structure derived from the MSA.
Scoring function: a good MSA should have a large number of cores (i.e. CORE-LEN) and also a small core RMSD. A core is a fully aligned column, consisting of one residue from each input protein. In addition, pairwise alignments in an MSA should also be of high quality. It is challenging to develop an algorithm that can optimize these criteria simultaneously since sometimes they contradict with one another. For example, a large CORE-LEN usually leads to a large RMSD value. A simple solution is to fix one criterion and then optimize the others, e.g. maximizing CORE-LEN while restricting RMSD. This solution is not very flexible in that we have to determine RMSD in advance, not to mention that neither CORE-LEN nor RMSD is the best measure.
We use

as the scoring function where

of an MSA is defined as the average TMscore of all the pairwise alignments implied in the MSA. TMscore is widely used to measure the pairwise structure similarity and the quality of a protein model. It is defined as follows:
Meanwhile, Lali is the alignment between protein
p and
q,
LS is the shorter protein length,
di is the deviation of two aligned residues and
d0 is the length-related normalization term.
Our scoring function has the following features: (i) it leads to an MSA with not only a large number of conserved cores, but also high-quality pairwise alignments; (ii) the distance between two aligned residues is used as denominator, so it favors the aligned residue pairs within small distance and disfavors or even ignores those with large deviation. This enables us to detect even a small conserved region among proteins of very different structures. In contrast, RMSD for the whole alignment, even normalized by alignment length, is (
Levitt and Gerstein, 1998;
Siew et al., 2000) greatly affected by a small number of badly aligned residue pairs and not sensitive in detecting small but conserved regions; (iii) TMscore is also better than the alignment length since the latter does not take into consideration the distance deviation of aligned residue pairs; (iv) TMscore is almost independent of the protein length (
Zhang and Skolnick, 2004) since the distance at each position is normalized by a length-dependent factor. This is particularly useful for the alignment of a large set of proteins with very different lengths; (v) TMscore is not only good for alignment, but also very sensitive in detecting fold-level similarity. As reported in
Xu and Zhang (2010), when TMscore >0.6, it is very likely (90% of chance) that two proteins have the same fold. When TMscore <0.4, it is very likely (90% of chance) that two proteins have different folds. When TMscore >0.5, there is 50% of chance that two proteins have similar folds.
Building an initial MSA from an HSFB: given an HSFB of M structures, we first generate a set of M − 1 rigid body transformations by superimposing each fragment in the HSFB to the pivot fragment and minimizing the RMSD of these two fragments. Then we superimpose each structure to the pivot structure using the transformation generated from fragment superimposition, and run dynamic programming to generate an alignment of the two structures by maximizing the TMscore. Finally, we assemble the M − 1 pairwise alignments into an initial MSA using the pivot structure as the anchor.
Adjustment of pairwise alignment: given the initial MSA, we may refine it by adjusting the pairwise alignment between each input structure and the pivot structure (
Wang and Zheng, 2008). First, we calculate the TMscore of the pairwise alignment between each input structure and the pivot. If TMscore <0.5 (
Xu and Zhang, 2010), this input structure is called ‘unanchored’. We adjust the alignment between each unanchored structure and the pivot using rigid body transformations derived from other
TopF HSFBs. In particular, for each top HSFB, let
F1 and
F2 denote the fragments in the HSFB belonging to the pivot and the unanchored structure, respectively. We realign the unanchored structure to the pivot structure using the rigid body transformation generated from minimizing the RMSD between
F1 and
F2. The pairwise alignment with the maximum TMscore is kept in the MSA.
Consensus-based MSA refinement: 3DCOMB refines an MSA by realigning each input structure to the consensus structure, which is constructed as follows. At each column of this MSA, we calculate the center of all the aligned residues (only Cα is considered). Second, we merge two neighbor columns into a single one if the following two conditions are satisfied: (i) the total number of aligned residues in these two columns is not more than the total number of input structures; and (ii) the distance between their two centers is <3.0 Å. We use 3.0 Å as the cutoff because in native protein structures, >99% of Cα−Cα virtual bonds are >3.0 Å. This merge procedure is repeated until no columns can be merged. The consensus structure consists of all the centers. This refinement procedure is repeated until a given number of iterations or the scoring function cannot be improved further.