Particular architecture used
The key information for developing the PROFtmb architecture originated from a careful interpretation of the details in eight experimental high-resolution structures of barrels. The idea was to encode the 3D structures through a discrete set of ‘structure states’. In doing this, we assumed that residues in similar micro-environments (such as the ‘aromatic cuff’) or following similar structural ‘grammars’ (e.g. all residues in the first position of a four-residue periplasmic beta-hairpin) would share selective pressure, and thus have a strongly biased residue composition. This was a natural extension from the observations of the ‘aromatic cuff’ and ‘hydrophobic belt’ (1
). Each discrete structural state was represented as an ‘architectural state’ in the HMM (rectangles, Fig. ). Having labelled each residue in the training set with a particular structure state, the grammar was specified as a consequence. If the state ‘aromatic cuff’ is followed by the state ‘extra-cellular loop’, this directed connection is specified in the HMM architecture (arrows, Fig. ). A few specific features were naturally modelled by this approach. First, the aromatic cuff states faced outward toward the lipid bilayer by definition. We defined all other beta-strand states (whether embedded in the membrane or overhanging on either side) relative to this position. Thus, the alternating pattern (pore-facing, lipid-facing, …) was implicitly modelled. Secondly, variable length of strands overhanging, on either side of the outer membrane were modelled simply by the presence of additional states that were all connected directly to the states ‘outer loop’, ‘inner loop’ and ‘hairpin’ (Fig. , dashed rectangles/lines). Note that we did explicitly use the observation of the enrichment of tyrosine and phenylalanine in the latitude described as the extra-cellular aromatic cuff, to help determine its position. Technically, we gave them two-letter names (more details in Materials and Methods and Supplementary Material).
Per-residue performance: most residues predicted correctly
In terms of per-residue accuracy, our method predicted ~80% of all strand residues correctly, reaching a Matthew’s correlation coefficient (equation 5
) as high as 0.7 (Table ). Many methods developed on high-resolution structures are evaluated on the protein sequence deposited in the PDB. These sequences often constitute only fragments of the full-length protein (65
). For TMBs, the major differences are that PDB sequences miss N-terminal residues including, but not restricted to, the signal peptide. Additionally, the sequence for OmpA (PDB identifier: 1qjp) lacked 154 C-terminal residues. Therefore, we also examined our method on the full-length protein sequences taken from SWISS-PROT. As it turned out, the performance was rather similar between the two sets (SetTMB versus SetTMBfull in Table ). The overall two-state per-residue accuracy was higher for the full-length sequences simply because most additional residues were trivially recognized as being ‘not membrane strand’. Overall, our method behaved very differently for full-length proteins: for PDB sequence fragments it over-predicted and for full-length sequences from SWISS-PROT it slightly over-predicted residues in membrane strands. At the same time, the observed strands were predicted even more accurately for full-length proteins (Table ). The non-realistic over-prediction on the PDB data set was also the main difference in per-residue accuracy between our HMM-based method and the one published previously by Martelli and colleagues (36
) (Table : SetTMBcomp). Martelli and colleagues did not evaluate their methods on full-length proteins. Since their HMM-based method was not publicly available during this work’s development, we could not explore whether the difference that we observed is generic or particular to our method.
Per-residue accuracy for different methods and different data setsa
Detailed four-state model surprisingly accurate
Although our HMM internally represents many structural states that correspond to the barrel grammar (Fig. ), the actual per-residue predictions were obtained by collapsing all these states into two (membrane-strand/other). Usually, two-state predictions reach numerically higher values than, e.g. four-state predictions due to the higher level for the random background (over-simplified: random is 50% for two states and 25% for four). We were thus surprised to observe that our four-state model was almost as accurate as the two-state reduction (Table ). In fact, PROFtmb was extremely successful in distinguishing between upward- and downward-strands and between periplasmic- and outer loops (bold-face in Table ). For example, 1171 were correctly predicted as membrane-strand, and only 14 of these confused the states up-strand and down-strand. Similarly, only 15 of the 1706 correctly predicted non-membrane strand residues confused periplasmic and outer-membrane. The latter may be due to the strong difference in length distributions of the (short) periplasmic loops and (long) outer loops.
Confusion matrix on four-state predictionsa
Multiple sequence alignments improved performance
Profiles from multiple sequence alignments contain important information about protein structure. In particular, using alignment information improved our model on average by about 18 percentage points in terms of two-state per-residue accuracy. Nevertheless, such profiles also constitute a source of noise, arising from alignment errors. Given the dramatic improvement due to the use of profiles, it is likely that additional improvement may be attainable through a more clever construction of alignments and profile extraction.
About 45% coverage at 100% accuracy in finding TMB
Although we made a special effort to curate our large data set used to establish the false positive rates of our method (SetROC), we found only minor differences between this large SWISS-PROT-based set and the data set taken from the PDB (SetROCcomp, Fig. ). The two data sets gave significantly different ROC scores (82.5 versus 54.9%). Also, the larger SWISS-PROT data (SetROC) yielded a much larger standard deviation in a simple bootstrap (66
) experiment (14.5 versus 3%). We used the larger data set to estimate the accuracy for whole-protein discrimination.
High accuracy and coverage above whole-protein score of 10
Applying our method to entirely sequenced proteins required introducing a threshold in the reliability of the prediction. This threshold reduced the number of incorrectly predicted membrane barrels. Our largest data set (SetROC) suggested that all proteins identified above whole-protein discrimination scores of 10 (equation 6) were indeed TMBs (100% accuracy, Fig. ). At this threshold ~45% of the TMB proteins in the data set were correctly identified (45% coverage, Fig. ). Although we did not thoroughly characterise the distribution of scores using standard statistical methods (such as Z-scores), we feel that the whole protein score suffices as a reasonable estimate for accuracy and coverage, especially considering the very conservative choice of cut-off thresholds that we used.
Figure 3 Threshold for accurate discrimination. Higher whole-protein discrimination scores (equation 6) yielded higher accuracy (correctly predicted TMBs/predicted TMBs) in discriminating between TMBs and non-TMBs (black line with filled circles). The flipside (more ...)
Case study for known TMBs
We ran PROFtmb on six known TMBs proposed to us. These were (number of transmembrane strands predicted in parentheses): adhesin AIDA-I precursor from E.coli (20 strands), S-layer protein, putative from Deinococcus radiodurans (30 strands), hypothetical protein TM0476 from Thermotoga maritima (36 strands), putative exported protein from Yersina pestis (10 strands), SomB from Synechococcus sp. (16 strands) and S-layer protein precursor from Thermus aquaticus (no per-residue prediction given). PROFtmb identifies the first four with scores over 8, corresponding to 95% accuracy. SomB was given a score of 3.15, corresponding to 90% accuracy. Finally, S-layer protein precursor obtained a score of –4.8 (accuracy 25%), too low for PROFtmb to provide a per-residue prediction. It is apparent that PROFtmb over-predicts the number of transmembrane strands in at least two of these proteins (TM0476 and S-layer protein), but gives reasonable per-residue predictions for the other three proteins.
Most TMBs appear to be known
We collected all proteins in each fully sequenced proteome of 72 Gram-negative, 15 ‘typical’ and five ‘atypical’ (Mycolata) Gram-positive bacteria (Fig. ). We also defined sets with ‘integral outer membrane’ proteins (IOM), their homologues (IOM_homo), outer-membrane proteins (OM), and their homologues (OM_homo; see Materials and Methods). We applied PROFtmb to all proteins in all proteomes and retained predictions with scores >8 (corresponding to 95% accuracy and 45% coverage, Fig. ). While PROFtmb identified 46% (69/148) of the experimentally known IOM proteins, it identified only 28% (388/1388) of the proteins that might have been labelled as ‘IOM’ based on sequence similarity alone (set of homologues). The significant discrepancy between these two results (IOM/IOM_homo) suggested that homology-based inference alone is likely to generate too many false positives. In contrast, PROFtmb identified only ~16% (91/560) of the proteins labelled as ‘outer membrane’ and only 5% (191/3829) of their homologues. Most likely this low percentage is a combination of actual TMB proteins missed by PROFtmb and peripheral outer membrane proteins that were not annotated precisely enough. Finally, PROFtmb found 164 new proteins at the 95% accuracy score cut-off which had—to the best of our knowledge—previously not been annotated as outer membrane nor were sequence similar to any outer membrane protein, not even at a liberal PSI-BLAST E-values <0.01.
Closer inspection of 164 new findings
Of the 164 completely novel finds, only two were from ‘typical’ Gram-positive bacteria, and thus false positives; all others originated from only 34 of the 72 Gram-negative proteomes. Those with six or more new proteins were: Vibrio vulnificus
(14 proteins), Vibrio parahaeomlyticus
) (12 proteins), Xanthomonas campestris
) (12 proteins), Shewanella oneidensis
) (12 proteins), Xanthomonas axonopodis
) (nine proteins), E.coli
) (eight proteins), Bacteroides thetaiotaomicron
) (eight proteins) and Yersinia pestis
) (seven proteins). ‘Atypical’ Gram-positive bacteria have outer membranes nearly twice as thick as those of Gram-negatives, composed of mycolic acid and a variety of extractible lipid, and contain pore-forming proteins (73
). A 1.7 nm electron microscopic image of MspA from Mycobacterium tuberculosis
revealed the pore to be 10 nm, in contrast to the ~4 nm pores of ‘typical’ Gram-positive pores (75
). Recently, the first structure of a mycobacterial outer membrane protein was solved (76
). Though it presents invaluable new information, we did not attempt to include it in our model for this study. Among the ‘atypical’ Gram-positives, PROFtmb identified four previously unidentified proteins. These were: conserved hypothetical protein (GI 15805996) and hypothetical protein (GI 15805156) from D.radiodurans
, PPE (GI 15610479) from M.tuberculosis
, and secreted endo
-1,4-beta-xylanase B (GI 21220761) from Streptomyces coelicolor
. Additionally, PROFtmb correctly identified a single protein, S-layer protein, putative (GI 15807560) from D.radiodurans
. However, very permissive sequence searches picked 32 proteins from these five proteomes that had some sequence similarity to known IOMs. PROFtmb detected none of these, possibly because sequence similarity was too permissive (hence the findings constituted false positives), but more likely because TMBs traversing thicker membrane may have to differ in detail and hence might not be modelled accurately by PROFtmb.
Comparison to other methods
We compared our findings in E.coli
to those of Zhai and Saier (35
). Their BBF program identifies 118 proteins: 47 previously known TMBs and 71 additional unknown proteins. PROFtmb identifies 54 proteins in E.coli
: 30 IOMs, 16 OMs (with no annotation as regards ‘integral or peripheral’; see Materials and Methods) and eight previously unknown. Between BBF and PROFtmb, only 24 proteins were commonly identified, with only one of those (yjbH protein precursor, GI 7451212) previously unknown. While this discrepancy reflects the substantial differences in the two procedures and stringency of cut-off thresholds used, the small overlap (24 out of 118 or 54) is still surprising. But, if accurate, BBF is complementary to PROFtmb, at least for E.coli
. We applied PROFtmb to eukaryotic proteins, but found that the program failed to accurately identify putative TMB proteins in these organisms. Since PROFtmb was trained exclusively on bacterial TMBs, failure to detect any eukaryotic TMBs is most likely due to the significantly different statistics of these structures. Attempting to address this problem, Schleiff et al
) use a pipeline to identify TMBs in the outer membrane of the chloroplast of A.thaliana
. However, the authors did not report any explicit predictions.