Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein–protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.
An ‘intrinsically disordered protein’ (IDP) is assumed to be unfolded in the cell and perform its biological function in that state. We contend that most intrinsically disordered proteins are in fact proteins waiting for a partner (PWPs), parts of a multi-component complex that do not fold correctly in the absence of other components. Flexibility, not disorder, is an intrinsic property of proteins, exemplified by X-ray structures of many enzymes and protein-protein complexes. Disorder is often observed with purified proteins in vitro and sometimes also in crystals, where it is difficult to distinguish from flexibility. In the crowded environment of the cell, disorder is not compatible with the known mechanisms of protein-protein recognition, and, foremost, with its specificity. The self-assembly of multi-component complexes may, nevertheless, involve the specific recognition of nascent polypeptide chains that are incompletely folded, but then disorder is transient, and it must remain under the control of molecular chaperones and of the quality control apparatus that obviates the toxic effects it can have on the cell.
Increasingly, experimental data on biological systems are obtained from several sources and computational approaches are required to integrate this information and derive models for the function of the system. Here, we demonstrate the power of a logic-based machine learning approach to propose hypotheses for gene function integrating information from two diverse experimental approaches. Specifically, we use inductive logic programming that automatically proposes hypotheses explaining the empirical data with respect to logically encoded background knowledge. We study the capsular polysaccharide biosynthetic pathway of the major human gastrointestinal pathogen Campylobacter jejuni. We consider several key steps in the formation of capsular polysaccharide consisting of 15 genes of which 8 have assigned function, and we explore the extent to which functions can be hypothesised for the remaining 7. Two sources of experimental data provide the information for learning—the results of knockout experiments on the genes involved in capsule formation and the absence/presence of capsule genes in a multitude of strains of different serotypes. The machine learning uses the pathway structure as background knowledge. We propose assignments of specific genes to five previously unassigned reaction steps. For four of these steps, there was an unambiguous optimal assignment of gene to reaction, and to the fifth, there were three candidate genes. Several of these assignments were consistent with additional experimental results. We therefore show that the logic-based methodology provides a robust strategy to integrate results from different experimental approaches and propose hypotheses for the behaviour of a biological system.
► A challenge in systems biology modelling is to integrate different data sources. ► A logic-based approach was used to hypothesise gene function for the C. jejuni glycome. ► Gene knockout and serotype strain data were integrated with pathway information. ► Functions were hypothesised for capsule polysaccharide biosynthetic pathway genes. ► Logic-based learning proposed hypotheses for the behaviour of a biological system.
ILP, inductive logic programming; CPS, capsular polysaccharide; HR-MAS, high-resolution magic angle spinning; CE-ESMS, capillary electrophoresis coupled to electrospray mass spectrometry; BBSRC, Biotechnology and Biological Sciences Research Council; systems biology; Campylobacter jejuni; machine learning; capsular polysaccharide; pathway modelling
The Critical Assessment of protein Structure Prediction experiment (CASP) is a blind assessment of the prediction of protein structure and related topics including function prediction. We present our results in the function/binding site prediction category. Our approach to identify binding sites combined the use of the predicted structure of the targets with both residue conservation and the location of ligands bound to homologous structures. We obtained average coverage of 83% and 56% accuracy. Analysis of our predictions suggests that overprediction reduces the accuracy obtained due to large areas of conservation around the binding site that do not bind the ligand. In some proteins such conserved residues may have a functional role. A server version of our method will soon be available.
bioinformatics; binding site; CASP; function prediction; structural biology
Motivation: Analysis of protein–protein interaction networks (PPINs) at the system level has become increasingly important in understanding biological processes. Comparison of the interactomes of different species not only provides a better understanding of species evolution but also helps with detecting conserved functional components and in function prediction.
Method and Results: Here we report a PPIN alignment method, called PINALOG, which combines information from protein sequence, function and network topology. Alignment of human and yeast PPINs reveals several conserved subnetworks between them that participate in similar biological processes, notably the proteasome and transcription related processes. PINALOG has been tested for its power in protein complex prediction as well as function prediction. Comparison with PSI-BLAST in predicting protein function in the twilight zone also shows that PINALOG is valuable in predicting protein function.
Availability and implementation: The PINALOG web-server is freely available from http://www.sbg.bio.ic.ac.uk/~pinalog. The PINALOG program and associated data are available from the Download section of the web-server.
Supplementary data are available at Bioinformatics online.
We carried out a genome-wide association study of hemoglobin levels in 16,001 individuals of European and Indian Asian ancestry. The most closely associated SNP (rs855791) results in nonsynonymous (V736A) change in the serine protease domain of TMPRSS6 and a blood hemoglobin concentration 0.13 (95% CI 0.09–0.17) g/dl lower per copy of allele A (P = 1.6 × 10−13). Our findings suggest that TMPRSS6, a regulator of hepcidin synthesis and iron handling, is crucial in hemoglobin level maintenance.
3DLigandSite is a web server for the prediction of ligand-binding sites. It is based upon successful manual methods used in the eighth round of the Critical Assessment of techniques for protein Structure Prediction (CASP8). 3DLigandSite utilizes protein-structure prediction to provide structural models for proteins that have not been solved. Ligands bound to structures similar to the query are superimposed onto the model and used to predict the binding site. In benchmarking against the CASP8 targets 3DLigandSite obtains a Matthew’s correlation co-efficient (MCC) of 0.64, and coverage and accuracy of 71 and 60%, respectively, similar results to our manual performance in CASP8. In further benchmarking using a large set of protein structures, 3DLigandSite obtains an MCC of 0.68. The web server enables users to submit either a query sequence or structure. Predictions are visually displayed via an interactive Jmol applet. 3DLigandSite is available for use at http://www.sbg.bio.ic.ac.uk/3dligandsite.
Macromolecular crowding has a profound effect upon biochemical processes in the cell. We have computationally studied the effect of crowding upon protein folding for 12 small domains in a simulated cell using a coarse-grained protein model, which is based upon Langevin dynamics, designed to unify the often disjoint goals of protein folding simulation and structure prediction. The model can make predictions of native conformation with accuracy comparable with that of the best current template-free models. It is fast enough to enable a more extensive analysis of crowding than previously attempted, studying several proteins at many crowding levels and further random repetitions designed to more closely approximate the ensemble of conformations. We found that when crowding approaches 40% excluded volume, the maximum level found in the cell, proteins fold to fewer native-like states. Notably, when crowding is increased beyond this level, there is a sudden failure of protein folding: proteins fix upon a structure more quickly and become trapped in extended conformations. These results suggest that the ability of small protein domains to fold without the help of chaperones may be an important factor in limiting the degree of macromolecular crowding in the cell. Here, we discuss the possible implications regarding the relationship between protein expression level, protein size, chaperone activity and aggregation.
TM, template modelling; macromolecular crowding; protein structure prediction; protein misfolding; protein aggregation; protein expression
Loops represent an important part of protein structures. The study of loop is critical for two main reasons: First, loops are often involved in protein function, stability and folding. Second, despite improvements in experimental and computational structure prediction methods, modeling the conformation of loops remains problematic. Here, we present a structural classification of loops, ArchDB, a mine of information with application in both mentioned fields: loop structure prediction and function prediction. ArchDB (http://sbi.imim.es/archdb) is a database of classified protein loop motifs. The current database provides four different classification sets tailored for different purposes. ArchDB-40, a loop classification derived from SCOP40, well suited for modeling common loop motifs. Since features relevant to loop structure or function can be more easily determined on well-populated clusters, we have developed ArchDB-95, a loop classification derived from SCOP95. This new classification set shows a ~40% increase in the number of subclasses, and a large 7-fold increase in the number of putative structure/function-related subclasses. We also present ArchDB-EC, a classification of loop motifs from enzymes, and ArchDB-KI, a manually annotated classification of loop motifs from kinases. Information about ligand contacts and PDB sites has been included in all classification sets. Improvements in our classification scheme are described, as well as several new database features, such as the ability to query by conserved annotations, sequence similarity, or uploading 3D coordinates of a protein. The lengths of classified loops range between 0 and 36 residues long. ArchDB offers an exhaustive sampling of loop structures. Functional information about loops and links with related biological databases are also provided. All this information and the possibility to browse/query the database through a web-server outline an useful tool with application in the comparative study of loops, the analysis of loops involved in protein function and to obtain templates for loop modeling.
function annotation; loop structure classification; loop modeling
This paper reports two studies to model the inter-relationships between protein sequence, structure and function. First, an automated pipeline to provide a structural annotation of proteomes in the major genomes is described. The results are stored in a database at Imperial College, London (3D-GENOMICS) that can be accessed at www.sbg.bio.ic.ac.uk. Analysis of the assignments to structural superfamilies provides evolutionary insights. 3D-GENOMICS is being integrated with related proteome annotation data at University College London and the European Bioinformatics Institute in a project known as e-protein (http://www.e-protein.org/). The second topic is motivated by the developments in structural genomics projects in which the structure of a protein is determined prior to knowledge of its function. We have developed a new approach PHUNCTIONER that uses the gene ontology (GO) classification to supervise the extraction of the sequence signal responsible for protein function from a structure-based sequence alignment. Using GO we can obtain profiles for a range of specificities described in the ontology. In the region of low sequence similarity (around 15%), our method is more accurate than assignment from the closest structural homologue. The method is also able to identify the specific residues associated with the function of the protein family.
bioinformatics; proteome annotation; protein function
The base excision repair (BER) pathway is essential for the removal of DNA bases damaged by alkylation or oxidation. A key step in BER is the processing of an apurinic/apyrimidinic (AP) site intermediate by an AP endonuclease. The major AP endonuclease in human cells (APE1, also termed HAP1 and Ref-1) accounts for >95% of the total AP endonuclease activity, and is essential for the protection of cells against the toxic effects of several classes of DNA damaging agents. Moreover, APE1 overexpression has been linked to radio- and chemo-resistance in human tumors. Using a newly developed high-throughput screen, several chemical inhibitors of APE1 have been isolated. Amongst these, CRT0044876 was identified as a potent and selective APE1 inhibitor. CRT0044876 inhibits the AP endonuclease, 3′-phosphodiesterase and 3′-phosphatase activities of APE1 at low micromolar concentrations, and is a specific inhibitor of the exonuclease III family of enzymes to which APE1 belongs. At non-cytotoxic concentrations, CRT0044876 potentiates the cytotoxicity of several DNA base-targeting compounds. This enhancement of cytotoxicity is associated with an accumulation of unrepaired AP sites. In silico modeling studies suggest that CRT0044876 binds to the active site of APE1. These studies provide both a novel reagent for probing APE1 function in human cells, and a rational basis for the development of APE1-targeting drugs for antitumor therapy.
The 3D-GENOMICS database (http://www.sbg.bio.ic.ac.uk/3dgenomics/) provides structural annotations for proteins from sequenced genomes. In August 2003 the database included data for 93 proteomes. The annotations stored in the database include homologous sequences from various sequence databases, domains from SCOP and Pfam, patterns from Prosite and other predicted sequence features such as transmembrane regions and coiled coils. In addition to annotations at the sequence level, several precomputed cross- proteome comparative analyses are available based on SCOP domain superfamily composition. Annotations are available to the user via a web interface to the database. Multiple points of entry are available so that a user is able to: (i) directly access annotations for a single protein sequence via keywords or accession codes, (ii) examine a sequence of interest chosen from a summary of annotations for a particular proteome, or (iii) access precomputed frequency-based cross-proteome comparative analyses.
The annotation of protein function has become a crucial problem with the advent of sequence and structural genomics initiatives. A large body of evidence suggests that protein structural information is frequently encoded in local sequences, and that folds are mainly made up of a number of simple local units of super-secondary structural motifs, consisting of a few secondary structures and their connecting loops. Moreover, protein loops play an important role in protein function. Here we present ArchDB, a classification database of structural motifs, consisting of one loop plus its bracing secondary structures. ArchDB currently contains 12 665 super-secondary elements classified into 1496 motif subclasses. The database provides an easy way to retrieve functional information from protein structures sharing a common motif, to search motifs found in a given SCOP family, superfamily or fold, or to search by keywords on proteins with classified loops. The ArchDB database of loops is located at http://sbi.imim.es/archdb.
Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence–structure–function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker’s yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).