The MetaTM algorithm
On the top level the consensus prediction is split into two major parts: (1) the segments consensus for finding TM segments and signal peptides (SPs), and (2) the N-terminal consensus. The latter determines whether the N-terminal end of the amino-acid sequence is located on the cytoplasmic or non-cytoplasmic side (also referred to as inside and outside, respectively) of the membrane. They are both predicted independently based on two different SVM models and afterwards combined into a final consensus topology.
The segments consensus can be roughly subdivided into the following steps: initially, the method scans the result of all incorporated predictors towards the C-terminus for the first occurring segment (see Figure ). If such a segment is found, segments from the other predictors that overlap with the first one are detected. This can also be thought of as applying a window reaching from the beginning of the first segment to its end, and then looking for other segments that intersect with this window (see Figure ). Subsequently, the SVM segment model predicts the consensus, which can be either a TM segment, an SP or no segment (i. e. loop). We termed this procedure voting.
Figure 1 Segments consensus workflow. For clarity only three predictors are drawn. The orange elements represent predicted TM segments. (A) Scanning the results for the first segment. (B) Detecting overlapping segments and voting whether the group of overlapping (more ...)
To have the consensus predicted by the SVM segment model, the results of the incorporated predictors for each window have to be encoded as a vector. In this case they are represented by a nine-dimensional vector with the following boolean values: Six for the TM topology predictors, two for the SP predictors, and finally one that indicates if the current window is the first for the current query sequence. This last value is an additional indicator for the prediction of signal peptides, as they can only appear at the N-terminal end of a sequence (and therefore only within the first window of a query sequence).
If the voting was positive (i. e. either an SP or TM segment should be added to the consensus prediction), the averages of the overlapping segments' start and end positions are calculated, respectively. If a window contains SPs and TM segments, only those segments which are of the same class as that predicted by the SVM model are used for the averaging (see also Figure ). Then all segments used for the prediction of the consensus segment are masked to not be used for following predictions. Afterwards the rest of the sequence is scanned for the next segment (see Figure ). Next, the cycle starts again from the beginning until no more unmasked segments are present.
As shown in Figure , if the voting result for a given group of overlapping segments is negative (i. e. the SVM model predicts a loop), only the first segment will be masked and excluded from the further prediction process (see also Figure ). Only if the voting result is positive are all overlapping segments masked (see also Figure ). This increases the chance of detecting consensus segments.
Figure 2 The masking procedure. For clarity only four predictors are drawn. The orange elements represent predicted TM segments. It is assumed in this example that three overlapping TM segments need to intersect with the voting window to have a positive consensus (more ...)
N-terminal location consensus
The N-terminal consensus in MetaTM is reached by a voting mechanism based on a second SVM model. Each predictor contributes to the result by voting either for N-terminus on the inside (cytoplasmic side) or N-terminus on the outside (non-cytoplasmic side). The results are encoded as an eight-dimensional vector with the following boolean values: six for the TM topology predictors, where 0 stands for the N-terminus being located on the inside and 1 for the outside, and two for the SP prediction of PolyPhobius and SignalP, respectively (1 if an SP has been predicted, otherwise 0). The last two values assist the N-terminal prediction such that the occurrence of an SP automatically leads to an outside N-terminal location. This is due to the biological fact that SPs are cleaved off from the remainder of the protein after it has been inserted across the membrane.
Comparison with single predictors
The prediction accuracy of MetaTM was assessed based on a data set containing 1460 TM protein sequences with known topologies and 2362 globular proteins, both with and without SPs [see Additional file 1
]. This is the largest data set used for benchmarking TM topology predictors so far. To uncover strengths and weaknesses of MetaTM and other predictors, the data set was split into six categories (see Table ).
First, the quality of the N-terminal location prediction was assessed on the four categories that contain TM protein sequences (see Table ). One can clearly see that PolyPhobius was the best single method when it comes to the prediction of sequences with signal peptides, as a predicted SP automatically leads to an N-terminus located on the outside. On the other hand, Memsat was the superior method for sequences without SPs. Although MetaTM was able to reach almost the same accuracy as PolyPhobius for the first category and matched PolyPhobius in the second category, its prediction quality is slightly less accurate than Memsat's for the latter two. However, since both PolyPhobius and Memsat predict rather poorly on two of the four sets, the overall performance of our consensus method was 5.6 and 8.3 percentage points better, respectively. Although it might not be obvious at first, all predictors contribute positively to the MetaTM result to some extent. Even TopPred, the weakest method, was able to tip the scales in favor of the correct prediction from time to time.
N-terminal location prediction results
The next comparison was done for the prediction of the correct number of TM segments on all six categories (see Table ). Again, PolyPhobius delivered very good results for sequences with SPs involved, but MetaTM was even better than PolyPhobius in two of the three SP data sets and equally good in the third one. For TM proteins without preceding signal peptides, Memsat and HMMTOP are the best among the single predictors. Also in these categories, MetaTM performs very well and is best together with HMMTOP in TMsingleNoSP and only slightly behind HMMTOP in TMmultiNoSP. For sequences with neither SPs nor TM segments (i. e. those in GLBnoSP) TMHMM is able to reach the highest prediction accuracy, and our consensus method the second best. On average, MetaTM performed better than all single predictors, followed by PolyPhobius (1.9 percentage points less accurate) and Memsat (12.0 percentage points less accurate).
Number of TM segments prediction results
The prediction of the entire TM topology (i. e. the N-terminal location and TM segments, where each predicted TM segment has to overlap the experimentally determined one with at least 5 residues) can be considered the supreme discipline in TM topology prediction (see Table ). The results look pretty much like a combination of the N-terminal location comparison and the TM segment number comparison. Especially in this — the most important — test, the performance of MetaTM was remarkably good. It was the best method in four of the six categories and in the remaining two sets MetaTM reached second place. On average the consensus method was 4.4 percentage points better than PolyPhobius, which took the second place, and 12.6 percentage points more accurate then Memsat, which was third in this comparison.
Entire topology prediction results
All discussed comparisons so far have not directly involved the SP prediction. The reason why the SP comparison has not been considered so far is simply that most of the methods do not support their prediction. However, it is possible to assess the signal peptide prediction accuracy for PolyPhobius, SignalP and MetaTM. In Table the prediction behavior of these three methods is plotted. While SignalP misses fewer signal peptides than PolyPhobius and MetaTM, it also over-predicts more (4.5 percentage points less accurate than MetaTM on average). MetaTM and PolyPhobius deliver quite similar results, although our consensus method is slightly better (1.2 percentage points on average).
Signal peptide prediction results
Comparison with previous consensus predictors
We wanted to compare MetaTM's results with ConPred II [25
], the most sophisticated of the existing consensus predictors. Unfortunately, the program is not available for local use, and an evaluation via its web interface was not feasible. Due to these limitations, a comparison between the two consensus methods could only be carried out by comparing MetaTM's results on the data set described in the ConPred II paper [25
] with ConPred's results reported in the same. It has to be mentioned that this data set is rather small (231 sequences) and it only contains TM proteins without signal peptides. Thus, this comparison is far from complete. As one can see in Table , MetaTM and ConPred perform similarly on the N-terminal location prediction and the number of correctly predicted TM segments, although MetaTM achieved a slightly higher accuracy (1.8 and 0.5 percentage points better, respectively). However, when predicting the entire topology MetaTM was 2.6 percentage points better than ConPred. While MetaTM was always better than any single predictor, ConPred performed slightly worse than PolyPhobius in the case of entire topology prediction.
ConPred II data set prediction results