The evolution of proteins is one of the fundamental processes that has delivered the diversity and complexity of life we see around ourselves today. While we tend to define protein evolution in terms of sequence level mutations, insertions and deletions, it is hard to translate these processes to a more complete picture incorporating a polypeptide's structure and function. By considering how protein structures change over time we can gain an entirely new appreciation of their long-term evolutionary dynamics. In this work we seek to identify how populations of proteins at different stages of evolution explore their possible structure space. We use an annotation of superfamily age to this space and explore the relationship between these ages and a diverse set of properties pertaining to a superfamily's sequence, structure and function. We note several marked differences between the populations of newly evolved and ancient structures, such as in their length distributions, secondary structure content and tertiary packing arrangements. In particular, many of these differences suggest a less elaborate structure for newly evolved superfamilies when compared with their ancient counterparts. We show that the structural preferences we report are not a residual effect of a more fundamental relationship with function. Furthermore, we demonstrate the robustness of our results, using significant variation in the algorithm used to estimate the ages. We present these age estimates as a useful tool to analyse protein populations. In particularly, we apply this in a comparison of domains containing greek key or jelly roll motifs.
Proteins are the molecular workers of the cell. They are formed from a string of amino acids which folds into an elaborate three-dimensional structure. While there is a relationship between a protein's sequence and its structure this relationship is highly complex and not fully understood. Protein structures tend to evolve differently to their sequences. They are far more conserved so tend to change slower. The aim of this paper was to identify trends in the way that protein structures evolve, rather than adapting models of sequence evolution. To do this we have provided a database of ages for structural superfamilies. These ages are robust to drastic differences in the evolutionary assumptions underlying their estimation and can be used to study differences between populations of proteins. For example, we have compared newly evolved structures against those with a long evolutionary history and found that, overall, a shorter evolutionary history corresponds to a less elaborate structure. We have also demonstrated here how these ages can be used to compare particular structural motifs present in a large number of protein structures and have shown that the jelly roll motif is significantly younger than the greek key.
Consistently predicting biopolymer structure at atomic resolution from sequence alone remains a difficult problem, even for small sub-segments of large proteins. Such loop prediction challenges, which arise frequently in comparative modeling and protein design, can become intractable as loop lengths exceed 10 residues and if surrounding side-chain conformations are erased. Current approaches, such as the protein local optimization protocol or kinematic inversion closure (KIC) Monte Carlo, involve stages that coarse-grain proteins, simplifying modeling but precluding a systematic search of all-atom configurations. This article introduces an alternative modeling strategy based on a ‘stepwise ansatz’, recently developed for RNA modeling, which posits that any realistic all-atom molecular conformation can be built up by residue-by-residue stepwise enumeration. When harnessed to a dynamic-programming-like recursion in the Rosetta framework, the resulting stepwise assembly (SWA) protocol enables enumerative sampling of a 12 residue loop at a significant but achievable cost of thousands of CPU-hours. In a previously established benchmark, SWA recovers crystallographic conformations with sub-Angstrom accuracy for 19 of 20 loops, compared to 14 of 20 by KIC modeling with a comparable expenditure of computational power. Furthermore, SWA gives high accuracy results on an additional set of 15 loops highlighted in the biological literature for their irregularity or unusual length. Successes include cis-Pro touch turns, loops that pass through tunnels of other side-chains, and loops of lengths up to 24 residues. Remaining problem cases are traced to inaccuracies in the Rosetta all-atom energy function. In five additional blind tests, SWA achieves sub-Angstrom accuracy models, including the first such success in a protein/RNA binding interface, the YbxF/kink-turn interaction in the fourth ‘RNA-puzzle’ competition. These results establish all-atom enumeration as an unusually systematic approach to ab initio protein structure modeling that can leverage high performance computing and physically realistic energy functions to more consistently achieve atomic accuracy.
Lattice models are a common abstraction used in the study of protein structure, folding, and refinement. They are advantageous because the discretisation of space can make extensive protein evaluations computationally feasible. Various approaches to the protein chain lattice fitting problem have been suggested but only a single backbone-only tool is available currently. We introduce LatFit, a new tool to produce high-accuracy lattice protein models. It generates both backbone-only and backbone-side-chain models in any user defined lattice. LatFit implements a new distance RMSD-optimisation fitting procedure in addition to the known coordinate RMSD method. We tested LatFit's accuracy and speed using a large nonredundant set of high resolution proteins (SCOP database) on three commonly used lattices: 3D cubic, face-centred cubic, and knight's walk. Fitting speed compared favourably to other methods and both backbone-only and backbone-side-chain models show low deviation from the original data (~1.5 Å RMSD in the FCC lattice). To our knowledge this represents the first comprehensive study of lattice quality for on-lattice protein models including side chains while LatFit is the only available tool for such models.
Membrane proteins are estimated to be the targets of 50% of drugs that are currently in development, yet we have few membrane protein crystal structures. As a result, for a membrane protein of interest, the much-needed structural information usually comes from a homology model. Current homology modelling software is optimized for globular proteins, and ignores the constraints that the membrane is known to place on protein structure. Our Memoir server produces homology models using alignment and coordinate generation software that has been designed specifically for transmembrane proteins. Memoir is easy to use, with the only inputs being a structural template and the sequence that is to be modelled. We provide a video tutorial and a guide to assessing model quality. Supporting data aid manual refinement of the models. These data include a set of alternative conformations for each modelled loop, and a multiple sequence alignment that incorporates the query and template. Memoir works with both α-helical and β-barrel types of membrane proteins and is freely available at http://opig.stats.ox.ac.uk/webapps/memoir.
Protein-protein interfaces hold the key to understanding protein-protein interactions. In this paper we investigated local interaction network patterns beyond pair-wise contact sites by considering interfaces as contact networks among residues. A contact site was defined as any residue on the surface of one protein which was in contact with a residue on the surface of another protein. We labeled the sub-graphs of these contact networks by their amino acid types. The observed distributions of these labeled sub-graphs were compared with the corresponding background distributions and the results suggested that there were preferred chemical patterns of closely packed residues at the interface. These preferred patterns point to biological constraints on physical proximity between those residues on one protein which were involved in binding to residues which were close on the interacting partner. Interaction interfaces were far from random and contain information beyond pairs and triangles. To illustrate the possible application of the local network patterns observed, we introduced a signature method, called iScore, based on these local patterns to assess interface predictions. On our data sets iScore achieved 83.6% specificity with 82% sensitivity.
Loops are irregular structures which connect two secondary structure elements in proteins. They often play important roles in function, including enzyme reactions and ligand binding. Despite their importance, their structure remains difficult to predict. Most protein loop structure prediction methods sample local loop segments and score them. In particular protein loop classifications and database search methods depend heavily on local properties of loops. Here we examine the distance between a loop’s end points (span). We find that the distribution of loop span appears to be independent of the number of residues in the loop, in other words the separation between the anchors of a loop does not increase with an increase in the number of loop residues. Loop span is also unaffected by the secondary structures at the end points, unless the two anchors are part of an anti-parallel beta sheet. As loop span appears to be independent of global properties of the protein we suggest that its distribution can be described by a random fluctuation model based on the Maxwell–Boltzmann distribution. It is believed that the primary difficulty in protein loop structure prediction comes from the number of residues in the loop. Following the idea that loop span is an independent local property, we investigate its effect on protein loop structure prediction and show how normalised span (loop stretch) is related to the structural complexity of loops. Highly contracted loops are more difficult to predict than stretched loops.
Protein structure; Protein loop; Protein structure prediction; Protein loop structure; Protein loop structure prediction; Protein; Loop stretch; Loop span
Alternative splicing has the potential to increase the diversity of the transcriptome and proteome. Where more than one transcript arises from a gene they are often so different that they are quite unlikely to have the same function. However, it remains unclear if alternative splicing generally leads to a gene being involved in multiple biological processes or whether it alters the function within a single process. Knowing that genetic interactions occur between functionally related genes, we have used them as a proxy for functional versatility, and have analysed the sets of genes of two well-characterised model organisms: Caenorhabditis elegans and Drosophila melanogaster. Using network analyses we find that few genes are functionally homogenous (only involved in a few functionally-related biological processes). Moreover, there are differences between alternatively spliced genes and genes with a single transcript; specifically, genes with alternatively splicing are, on average, involved in more biological processes. Finally, we suggest that factors other than specific functional classes determine whether a gene is alternatively spliced.
Male factor and idiopathic infertility contribute significantly to global infertility, with abnormal testicular gene expression considered to be a major cause. Certain types of male infertility are caused by failure of the sperm to activate the oocyte, a process normally regulated by calcium oscillations, thought to be induced by a sperm-specific phospholipase C, PLCzeta (PLCζ). Previously, we identified a point mutation in an infertile male resulting in the substitution of histidine for proline at position 398 of the protein sequence (PLCζH398P), leading to abnormal PLCζ function and infertility.
METHODS AND RESULTS
Here, using a combination of direct-sequencing and mini-sequencing of the PLCζ gene from the patient and his family, we report the identification of a second PLCζ mutation in the same patient resulting in a histidine to leucine substitution at position 233 (PLCζH233L), which is predicted to disrupt local protein interactions in a manner similar to PLCζH398P and was shown to exhibit abnormal calcium oscillatory ability following predictive 3D modelling and cRNA injection in mouse oocytes respectively. We show that PLCζH233L and PLCζH398P exist on distinct parental chromosomes, the former inherited from the patient's mother and the latter from his father. Neither mutation was detected utilizing custom-made single-nucleotide polymorphism assays in 100 fertile males and females, or 8 infertile males with characterized oocyte activation deficiency.
Collectively, our findings provide further evidence regarding the importance of PLCζ at oocyte activation and forms of male infertility where this is deficient. Additionally, we show that the inheritance patterns underlying male infertility are more complex than previously thought and may involve maternal mechanisms.
infertility; oocyte activation; sperm; phophospholipase C zeta (PLCzeta); inheritance
The ability to predict the effect of mutations on protein stability is important for a wide range of tasks, from protein engineering to assessing the impact of SNPs to understanding basic protein biophysics. A number of methods have been developed that make these predictions, but assessing the accuracy of these tools is difficult given the limitations and inconsistencies of the experimental data. We evaluate four different methods based on the ability of these methods to generate consistent results for forward and back mutations, and examine how this ability varies with the nature and location of the mutation. We find that, while one method seems to outperform the others, the ability of these methods to make accurate predictions is limited.
The notion that sequence homology implies functional similarity underlies much of computational biology. In the case of protein-protein interactions, an interaction can be inferred between two proteins on the basis that sequence-similar proteins have been observed to interact. The use of transferred interactions is common, but the legitimacy of such inferred interactions is not clear. Here we investigate transferred interactions and whether data incompleteness explains the lack of evidence found for them. Using definitions of homology associated with functional annotation transfer, we estimate that conservation rates of interactions are low even after taking interactome incompleteness into account. For example, at a blastp -value threshold of , we estimate the conservation rate to be about between S. cerevisiae and H. sapiens. Our method also produces estimates of interactome sizes (which are similar to those previously proposed). Using our estimates of interaction conservation we estimate the rate at which protein-protein interactions are lost across species. To our knowledge, this is the first such study based on large-scale data. Previous work has suggested that interactions transferred within species are more reliable than interactions transferred across species. By controlling for factors that are specific to within-species interaction prediction, we propose that the transfer of interactions within species might be less reliable than transfers between species. Protein-protein interactions appear to be very rarely conserved unless very high sequence similarity is observed. Consequently, inferred interactions should be used with care.
It is widely assumed that knowledge gained in one species can be transferred to another species, even among species that are widely separated on the tree of life. This transfer is often done at the level of proteins under the assumption that if two proteins have similar sequences, they will share similar properties. In this paper, we investigate the validity of this assumption for the case of protein-protein interactions. The transfer of protein interactions across species is a common procedure and it is known to have shortcomings but these are generally ascribed to the incompleteness of protein interaction data. We introduce a framework to take such incomplete information into account, and under its assumptions show that the procedure is unreliable when using sequence-similarity thresholds typically thought to allow the transfer of functional information. Our results imply that, unless using strict definitions of homology, interactions rewire at a rate too fast to allow reliable transfer across species. We urge caution in interpreting the results of such transfers.
Predicting protein contacts solely based on sequence information remains a challenging problem, despite the huge amount of sequence data at our disposal. Mutual Information (MI), an information theory measure, has been extensively employed and modified to identify residues within a protein (intra-protein) that are in contact. More recently MI and its variants have also been used in the prediction of contacts between proteins (inter-protein).
Here we assess the predictive power of MI and variants for domain-domain contact prediction. We test original MI and these variants, which are called MIp, MIc and ZNMI, on 40 domain-domain test cases containing 10,753 sequences. We also propose and evaluate two new versions of MI that consider triangles of residues and the physiochemical properties of the amino acids, respectively.
We found that all versions of MI are skewed towards predicting surface residues. Since domain-domain contacts are on the surface of each domain, we considered only surface residues when attempting to predict contacts. Our analysis shows that MIc is the best current MI domain-domain contact predictor. At 20% recall MIc achieved a precision of 44.9% when only surface residues were considered. Our triangle and reduced alphabet variants of MI highlight the delicate trade-off between signal and noise in the use of MI for domain-domain contact prediction. We also examine a specific “successful” case study and demonstrate that here, when considering surface residues, even the most accurate domain-domain contact predictor, MIc, performs no better than random.
All tested variants of MI are skewed towards predicting surface residues. When considering surface residues only, we find MIc to be the best current MI domain-domain contact predictor. Its performance, however, is not as good as a non-MI based contact predictor, i-Patch. Additionally, the intra-protein contact prediction capabilities of MIc outperform its domain-domain contact prediction abilities.
Phosphosignalling pathways are an attractive option for the synthetic biologist looking for a wide repertoire of modular components from which to build. We demonstrate that two-component systems can be used in synthetic biology. However, their potential is limited by the fact that host cells contain many of their own phosphosignalling pathways and these may interact with, and cross-talk to, the introduced synthetic components. In this paper we also demonstrate a simple bioinformatic tool that can help predict whether interspecies cross-talk between introduced and native two-component signalling pathways will occur and show both in vitro and in vivo that the predicted interactions do take place. The ability to predict potential cross-talk prior to designing and constructing novel pathways or choosing a host organism is essential for the promise that phosphosignalling components hold for synthetic biology to be realised.
Protein-protein interactions play an essential role in cellular processes. Certain proteins form stable complexes with their partner proteins, whereas others function by forming transient complexes. The conventional protein-protein interaction model describes an interaction between two proteins under the assumption that a protein binds to its partner protein through a single binding site. In this study, we improved the conventional interaction model by developing a Multiple-Site (MS) model in which a protein binds to its partner protein through closely located multiple binding sites on a surface of the partner protein by transiently docking at each binding site with individual binding free energies. To test this model, we used the protein-protein interaction mediated by Src homology 3 (SH3) domains. SH3 domains recognize their partners via a weak, transient interaction and are therefore promiscuous in nature. Because the MS model requires large amounts of data compared with the conventional interaction model, we used experimental data from the positionally addressable syntheses of peptides on cellulose membranes (SPOT-synthesis) technique. From the analysis of the experimental data, individual binding free energies for each binding site of peptides were extracted. A comparison of the individual binding free energies from the analysis with those from atomistic force fields gave a correlation coefficient of 0.66. Furthermore, application of the MS model to 10 SH3 domains lowers the prediction error by up to 9% compared with the conventional interaction model. This improvement in prediction originates from a more realistic description of complex formation than the conventional interaction model. The results suggested that, in many cases, SH3 domains increased the protein complex population through multiple binding sites of their partner proteins. Our study indicates that the consideration of general complex formation is important for the accurate description of protein complex formation, and especially for those of weak or transient protein complexes.
Computational prediction of residues that participate in protein-protein interactions is a difficult task, and state of the art methods have shown only limited success in this arena. One possible problem with these methods is that they try to predict interacting residues without incorporating information about the partner protein, although it is unclear how much partner information could enhance prediction performance. To address this issue, the two following comparisons are of crucial significance: (a) comparison between the predictability of inter-protein residue pairs, i.e., predicting exactly which residue pairs interact with each other given two protein sequences; this can be achieved by either combining conventional single-protein predictions or making predictions using a new model trained directly on the residue pairs, and the performance of these two approaches may be compared: (b) comparison between the predictability of the interacting residues in a single protein (irrespective of the partner residue or protein) from conventional methods and predictions converted from the pair-wise trained model. Using these two streams of training and validation procedures and employing similar two-stage neural networks, we showed that the models trained on pair-wise contacts outperformed the partner-unaware models in predicting both interacting pairs and interacting single-protein residues. Prediction performance decreased with the size of the conformational change upon complex formation; this trend is similar to docking, even though no structural information was used in our prediction. An example application that predicts two partner-specific interfaces of a protein was shown to be effective, highlighting the potential of the proposed approach. Finally, a preliminary attempt was made to score docking decoy poses using prediction of interacting residue pairs; this analysis produced an encouraging result.
Motivation: Membrane proteins are both abundant and important in cells, but the small number of solved structures restricts our understanding of them. Here we consider whether membrane proteins undergo different substitutions from their soluble counterparts and whether these can be used to improve membrane protein alignments, and therefore improve prediction of their structure.
Results: We construct substitution tables for different environments within membrane proteins. As data is scarce, we develop a general metric to assess the quality of these asymmetric tables. Membrane proteins show markedly different substitution preferences from soluble proteins. For example, substitution preferences in lipid tail-contacting parts of membrane proteins are found to be distinct from all environments in soluble proteins, including buried residues. A principal component analysis of the tables identifies the greatest variation in substitution preferences to be due to changes in hydrophobicity; the second largest variation relates to secondary structure. We demonstrate the use of our tables in pairwise sequence-to-structure alignments (also known as ‘threading’) of membrane proteins using the FUGUE alignment program. On average, in the 10–25% sequence identity range, alignments are improved by 28 correctly aligned residues compared with alignments made using FUGUE's default substitution tables. Our alignments also lead to improved structural models.
Availability: Substitution tables are available at: http://www.stats.ox.ac.uk/proteins/resources.
Motivation: Membrane proteins (MPs) are important drug targets but knowledge of their exact structure is limited to relatively few examples. Existing homology-based structure prediction methods are designed for globular, water-soluble proteins. However, we are now beginning to have enough MP structures to justify the development of a homology-based approach specifically for them.
Results: We present a MP-specific homology-based coordinate generation method, MEDELLER, which is optimized to build highly reliable core models. The method outperforms the popular structure prediction programme Modeller on MPs. The comparison of the two methods was performed on 616 target–template pairs of MPs, which were classified into four test sets by their sequence identity. Across all targets, MEDELLER gave an average backbone root mean square deviation (RMSD) of 2.62 Å versus 3.16 Å for Modeller. On our ‘easy’ test set, MEDELLER achieves an average accuracy of 0.93 Å backbone RMSD versus 1.56 Å for Modeller.
Availability and Implementation: http://medeller.info; Implemented in Python, Bash and Perl CGI for use on Linux systems; Supplementary data are available at http://www.stats.ox.ac.uk/proteins/resources.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: A wealth of protein–protein interaction (PPI) data has recently become available. These data are organized as PPI networks and an efficient and biologically meaningful method to compare such PPI networks is needed. As a first step, we would like to compare observed networks to established network models, under the aspect of small subgraph counts, as these are conjectured to relate to functional modules in the PPI network. We employ the software tool GraphCrunch with the Graphlet Degree Distribution Agreement (GDDA) score to examine the use of such counts for network comparison.
Results: Our results show that the GDDA score has a pronounced dependency on the number of edges and vertices of the networks being considered. This should be taken into account when testing the fit of models. We provide a method for assessing the statistical significance of the fit between random graph models and biological networks based on non-parametric tests. Using this method we examine the fit of Erdös–Rényi (ER), ER with fixed degree distribution and geometric (3D) models to PPI networks. Under these rigorous tests none of these models fit to the PPI networks. The GDDA score is not stable in the region of graph density relevant to current PPI networks. We hypothesize that this score instability is due to the networks under consideration having a graph density in the threshold region for the appearance of small subgraphs. This is true for both geometric (3D) and ER random graph models. Such threshold behaviour may be linked to the robustness and efficiency properties of the PPI networks.
Supplementary information: Supplementary data are available at Bioinformatics online.
The idea of “date” and “party” hubs has been influential in the study of protein–protein interaction networks. Date hubs display low co-expression with their partners, whilst party hubs have high co-expression. It was proposed that party hubs are local coordinators whereas date hubs are global connectors. Here, we show that the reported importance of date hubs to network connectivity can in fact be attributed to a tiny subset of them. Crucially, these few, extremely central, hubs do not display particularly low expression correlation, undermining the idea of a link between this quantity and hub function. The date/party distinction was originally motivated by an approximately bimodal distribution of hub co-expression; we show that this feature is not always robust to methodological changes. Additionally, topological properties of hubs do not in general correlate with co-expression. However, we find significant correlations between interaction centrality and the functional similarity of the interacting proteins. We suggest that thinking in terms of a date/party dichotomy for hubs in protein interaction networks is not meaningful, and it might be more useful to conceive of roles for protein-protein interactions rather than for individual proteins.
Proteins are key components of cellular machinery, and most cellular functions are executed by groups of proteins acting in concert. The study of networks formed by protein interactions can help reveal how the complex functionality of cells emerges from simple biochemistry. Certain proteins have a particularly large number of interaction partners; some have argued that these “hubs” are essential to biological function. Previous work has suggested that such hubs can be classified into just two varieties: party hubs, which coordinate a specific cellular process or protein complex; and date hubs, which link together and convey information between different function-specific modules or complexes. In this study, we re-examine the ideas of date and party hubs from multiple perspectives. By computationally partitioning protein interaction networks into functionally coherent subnetworks, we show that the roles of hubs are more diverse than a binary classification allows. We also show that the position of an interaction in the network is related to the functional similarity of the two interacting proteins: the most important interactions holding the network together appear to be between the most dissimilar proteins. Thus, examining interaction roles may be relevant to understanding the organisation of protein interaction networks.
Translation of mRNA into protein is a unidirectional information flow process. Analysing the input (mRNA) and output (protein) of translation, we find that local protein structure information is encoded in the mRNA nucleotide sequence. The Coding Sequence and Structure (CSandS) database developed in this work provides a detailed mapping between over 4000 solved protein structures and their mRNA. CSandS facilitates a comprehensive analysis of codon usage over many organisms. In assigning translation speed, we find that relative codon usage is less informative than tRNA concentration. For all speed measures, no evidence was found that domain boundaries are enriched with slow codons. In fact, genes seemingly avoid slow codons around structurally defined domain boundaries. Translation speed, however, does decrease at the transition into secondary structure. Codons are identified that have structural preferences significantly different from the amino acid they encode. However, each organism has its own set of ‘significant codons’. Our results support the premise that codons encode more information than merely amino acids and give insight into the role of translation in protein folding.
Ever since the ground-breaking work of Anfinsen et al. in which a denatured protein was found to refold to its native state, it has been frequently stated by the protein fold prediction community that all the information required for protein folding lies in the amino acid sequence. Recent in vitro experiments and in silico computational studies, however, have shown that cotranslation may affect the folding pathway of some proteins, especially those of ancient folds. In this paper aspects of cotranslational folding have been incorporated into a protein structure prediction algorithm by adapting the Rosetta program to fold proteins as the nascent chain elongates. This makes it possible to conduct a pairwise comparison of folding accuracy, by comparing folds created sequentially from each end of the protein.
A single main result emerged: in 94% of proteins analyzed, following the sense of translation, from N-terminus to C-terminus, produced better predictions than following the reverse sense of translation, from the C-terminus to N-terminus. Two secondary results emerged. First, this superiority of N-terminus to C-terminus folding was more marked for proteins showing stronger evidence of cotranslation and second, an algorithm following the sense of translation produced predictions comparable to, and occasionally better than, Rosetta.
There is a directionality effect in protein fold prediction. At present, prediction methods appear to be too noisy to take advantage of this effect; as techniques refine, it may be possible to draw benefit from a sequential approach to protein fold prediction.
Chemotaxis is the process by which motile bacteria sense their chemical environment and move towards more favourable conditions. Escherichia coli utilises a single sensory pathway, but little is known about signalling pathways in species with more complex systems.
To investigate whether chemotaxis pathways in other bacteria follow the E. coli paradigm, we analysed 206 species encoding at least 1 homologue of each of the 5 core chemotaxis proteins (CheA, CheB, CheR, CheW and CheY). 61 species encode more than one of all of these 5 proteins, suggesting they have multiple chemotaxis pathways. Operon information is not available for most bacteria, so we developed a novel statistical approach to cluster che genes into putative operons. Using operon-based models, we reconstructed putative chemotaxis pathways for all 206 species. We show that cheA-cheW and cheR-cheB have strong preferences to occur in the same operon as two-gene blocks, which may reflect a functional requirement for co-transcription. However, other che genes, most notably cheY, are more dispersed on the genome. Comparison of our operons with shuffled equivalents demonstrates that specific patterns of genomic location may be a determining factor for the observed in vivo chemotaxis pathways.
We then examined the chemotaxis pathways of Rhodobacter sphaeroides. Here, the PpfA protein is known to be critical for correct partitioning of proteins in the cytoplasmically-localised pathway. We found ppfA in che operons of many species, suggesting that partitioning of cytoplasmic Che protein clusters is common. We also examined the apparently non-typical chemotaxis components, CheA3, CheA4 and CheY6. We found that though variants of CheA proteins are rare, the CheY6 variant may be a common type of CheY, with a significantly disordered C-terminal region which may be functionally significant.
We find that many bacterial species potentially have multiple chemotaxis pathways, with grouping of che genes into operons likely to be a major factor in keeping signalling pathways distinct. Gene order is highly conserved with cheA-cheW and cheR-cheB blocks, perhaps reflecting functional linkage. CheY behaves differently to other Che proteins, both in its genomic location and its putative protein interactions, which should be considered when modelling chemotaxis pathways.
Motivation: Functional module detection within protein interaction networks is a challenging problem due to the sparsity of data and presence of errors. Computational techniques for this task range from purely graph theoretical approaches involving single networks to alignment of multiple networks from several species. Current network alignment methods all rely on protein sequence similarity to map proteins across species.
Results: Here we carry out network alignment using a protein functional similarity measure. We show that using functional similarity to map proteins across species improves network alignment in terms of functional coherence and overlap with experimentally verified protein complexes. Moreover, the results from functional similarity-based network alignment display little overlap (<15%) with sequence similarity-based alignment. Our combined approach integrating sequence and function-based network alignment alongside graph clustering properties offers a 200% increase in coverage of experimental datasets and comparable accuracy to current network alignment methods.
Availability: Program binaries and source code is freely available at http://www.stats.ox.ac.uk/research/bioinfo/resources
Supplementary Information: Supplementary data are available at Bioinformatics online.
Summary: iMembrane is a homology-based method, which predicts a membrane protein's position within a lipid bilayer. It projects the results of coarse-grained molecular dynamics simulations onto any membrane protein structure or sequence provided by the user. iMembrane is simple to use and is currently the only computational method allowing the rapid prediction of a membrane protein's lipid bilayer insertion. Bilayer insertion data are essential in the accurate structural modelling of membrane proteins or the design of drugs that target them.
Availability: http://imembrane.info. iMembrane is available under a non-commercial open-source licence, upon request.
Supplementary information: Supplementary data are available at Bioinformatics online and at http://www.stats.ox.ac.uk/proteins/resources.
Protein interactions play a vital part in the function of a cell. As experimental techniques for detection and validation of protein interactions are time consuming, there is a need for computational methods for this task. Protein interactions appear to form a network with a relatively high degree of local clustering. In this paper we exploit this clustering by suggesting a score based on triplets of observed protein interactions. The score utilises both protein characteristics and network properties. Our score based on triplets is shown to complement existing techniques for predicting protein interactions, outperforming them on data sets which display a high degree of clustering. The predicted interactions score highly against test measures for accuracy. Compared to a similar score derived from pairwise interactions only, the triplet score displays higher sensitivity and specificity. By looking at specific examples, we show how an experimental set of interactions can be enriched and validated. As part of this work we also examine the effect of different prior databases upon the accuracy of prediction and find that the interactions from the same kingdom give better results than from across kingdoms, suggesting that there may be fundamental differences between the networks. These results all emphasize that network structure is important and helps in the accurate prediction of protein interactions. The protein interaction data set and the program used in our analysis, and a list of predictions and validations, are available at http://www.stats.ox.ac.uk/bioinfo/resources/PredictingInteractions.
For understanding the complex activities within an organism, a complete and error-free network of protein interactions which occur in the organism would be a significant step forward. The large amount of experimentally derived data now available has provided us with a chance to study the complicated behaviour of protein interactions. The power of such studies, however, has been limited due to the high false positive and false negative rates in the datasets. We propose a network-based method, taking advantage of the tendency of clustering in protein interaction networks, to validate experimental data and to predict unknown interactions. The integration of multiple protein characteristics (i.e., structure, function, etc.) allows our predictive method to significantly outperform two other approaches based on homology and protein-domain relationships on datasets which contain a large amount of interactions, but not much detailed information on the proteins involved in the interactions. In addition, our predictive score based on triadic interaction patterns improves over a pair-wise approach, suggesting the importance of network structure. Moreover, using pooled interactions as prior information, we find evidence for fundamental differences in protein interaction networks between eukaryotes and prokaryotes.
Microtubules (MTs), which play crucial roles in normal cell function, are regulated by MT associated proteins (MAPs). Using a combinatorial approach that includes biochemistry, proteomics and bioinformatics, we have recently identified 270 putative MAPs from Drosophila embryos and characterized some of those required for correct progression through mitosis. Here we identify functional groups of these MAPs using a reciprocal hits sequence alignment technique and assign InterPro functional domains to 28 previously uncharacterized proteins. This approach gives insight into the potential functions of MAPs and how their roles may affect MTs.
Drosophila; domain; microtubule; MAP; alignment