The transmembrane β-barrel (TMBB) is one of two major structural classes of membrane-spanning proteins; TM helical bundles are the other. TMBBs are found in the outer membranes of Gram-negative bacteria, mitochondria and chloroplasts, while TM helical bundles are found in the cytoplasmic membranes of all living organisms. Although genes that encode TMBBs are estimated to represent at least 3% of all protein-coding genes in Gram-negative bacteria, TMBBs represent only 1% of the solved protein structures from Gram-negative organisms. As a rapidly expanding number of genomic sequences become available, using in silico methods to identify previously unknown TMBBs is an appealing alternative to more difficult and time-consuming experimental methods such as crystallography. Computational TMBB prediction methods can identify candidate genes in order to perform experimental validation or structural proteomics on a more focused population. These methods also provide the opportunity to identify and characterize TMBBs that may not be expressed under standard culturing conditions and thus, would go unobserved using traditional screening methods such as proteomic analysis.
Computational prediction methods have been used to predict TM helices with an accuracy of 99% for nearly a decade. TM helices are simple stretches of 19–25 hydrophobic residues, which can be predicted with near-perfect accuracy using experimentally determined hydrophobicity scales; an example of such a program is MPEX (Jayasinghe et al.
; Snider et al.
). However, the prediction of TMBBs presents a more difficult challenge due to the cryptic nature of the TMBB structure (Wimley, 2002
). The TMBB structure is a series of anti-parallel β-strands that are arranged in a cylindrical geometry forming a structure that resembles a barrel (Schulz, 2000
). The TM β-strands of TMBBs consist of ~10 amino acids arranged in an alternating, dyad repeat pattern of hydrophobic and hydrophilic residues, where the hydrophobic side-chains face the lipid environment and the hydrophilic side-chains face the interior of the β-barrel. The β-hairpin, which is the major structural unit of the TMBB, is a pair of anti-parallel TM β-strands connected by a short loop of 3–7 residues (i.e., hairpin turn). The β-hairpins are connected to each other by loops of varying length. The complexities and irregularities in the structure including the variations in loop length and composition, deviations from the pattern of hydrophobicity in some β-strands, and the low information content (e.g., only five hydrophobic residues in a TM strand) make the identification of TMBBs especially problematic (Wimley, 2003
There are a wide variety of TMBB prediction algorithms that utilize machine learning methods ranging from Bayesian networks to k
-nearest neighbor methods. Machine learning methods are designed to identify the common features of the TMBBs in a training dataset as well as features that distinguish TMBBs from other types of proteins. The distinguishing variables, as interpreted by the algorithm, are used as rules to classify a test sequence (Gromiha and Suwa, 2006
). Although these methods can yield reasonable TMBB prediction accuracies (64–97%), their predictions are still less reliable than those made for TM helical bundles (Gromiha and Suwa, 2006
; Hu and Yan, 2008
). Besides achieving less than ideal prediction accuracy, a major disadvantage of using a machine learning method is that it cannot be used for hypothesis testing because the variables used to make the predictions are either hidden or arbitrary, thus there is no discernable link between the variables and the physicochemical properties of the experimentally solved TMBB structures.
A TMBB prediction algorithm based on the physicochemical properties of TMBBs was developed in this lab (Wimley, 2002
). This algorithm is based on an analysis of the structure and composition of known TMBBs. The algorithm identifies the positions of TM β-strands using a simple pattern-recognition scheme, which utilizes the statistical amino acid abundance data derived from known structures. The observed amino acid abundances from the TM β-strands are compared to the expected genomic abundance, and the difference between the two abundances yields information about patterns and composition unique to the TM segments of TMBBs. The algorithm uses the resulting abundance values to identify 10-residue-long β-strands with dyad repeat patterns. Next, adjacent β-strands are scored for β-hairpin-forming potential, and the β-hairpin score data is used in a function to give a protein sequence a single β-barrel score. The β-barrel score is a rating of the overall propensity of the sequence to fold into a TMBB.
The initial goal of this work was to rigorously evaluate the performance of this algorithm since it was intended to make predictions for genomic sequences, which will be listed in an annotated database. The performance of the original algorithm was evaluated using a non-redundant protein database (NRPDB) with 14 238 proteins of known structure from the Protein Data Bank (PDB; Berman et al.
). Each sequence was given a β-barrel score, which was used as a threshold-dependent binomial classifier to identify each sequence as either a TMBB or non-TMBB. Using the NRPDB as a stringent test set, the performances of the original prediction algorithm, as well as other prediction algorithms, were unsatisfactory because they had very large rates of false positive predictions.
The algorithm described in this work was developed to address the specific weaknesses in the ability of the original algorithm to discriminate against non-TMBBs. The modified algorithm, which we call the Freeman–Wimley algorithm, showed a substantial improvement, from 87% to 99% when analyzing the NRPDB. The accuracy of the Freeman–Wimley algorithm is comparable to the accuracy of TM helix prediction and exceeds the accuracy of other TMBB prediction methods. Furthermore, an analysis of the Escherichia coli genome has revealed that the Freeman–Wimley algorithm is more efficient at distinguishing TMBBs from non-TMBBs in genomic databases compared to the NRPDB. This work represents significant progress in the computational identification of genomic TMBB sequences.