Construction of a reliable network remains the bottleneck for network-based protein function prediction. We built an artificial network model called protein overlap network (PON) for the entire genome of yeast, fly, worm, and human, respectively. Each node of the network represents a protein, and two proteins are connected if they share a domain according to InterPro database.
The function of a protein can be predicted by counting the occurrence frequency of GO (gene ontology) terms associated with domains of direct neighbors. The average success rate and coverage were 34.3% and 43.9%, respectively, for the test genomes, and were increased to 37.9% and 51.3% when a composite PON of the four species was used for the prediction. As a comparison, the success rate was 7.0% in the random control procedure. We also made predictions with GO term annotations of the second layer nodes using the composite network and obtained an impressive success rate (>30%) and coverage (>30%), even for small genomes. Further improvement was achieved by statistical analysis of manually annotated GO terms for each neighboring protein.
The PONs are composed of dense modules accompanied by a few long distance connections. Based on the PONs, we developed multiple approaches effective for protein function prediction.
Protein overlap network; Protein function prediction; Composite network; Functional genomics
Identification of cis- and trans-acting factors regulating gene expression remains an important problem in biology. Bioinformatics analyses of regulatory regions are hampered by several difficulties. One is that binding sites for regulatory proteins are often not significantly over-represented in the set of DNA sequences of interest, because of high levels of false positive predictions, and because of positional restrictions on functional binding sites with regard to the transcription start site.
We have developed a novel method for the detection of regulatory motifs based on their local over-representation in sets of regulatory regions. The method makes use of a Parzen window-based approach for scoring local enrichment, and during evaluation of significance it takes into account GC content of sequences. We show that the accuracy of our method compares favourably to that of other methods, and that our method is capable of detecting not only generally over-represented regulatory motifs, but also locally over-represented motifs that are often missed by standard motif detection approaches. Using a number of examples we illustrate the validity of our approach and suggest applications, such as the analysis of weaker binding sites.
Our approach can be used to suggest testable hypotheses for wet-lab experiments. It has potential for future analyses, such as the prediction of weaker binding sites. An online application of our approach, called LocaMo Finder (Local Motif Finder), is available at http://sysimm.ifrec.osaka-u.ac.jp/tfbs/locamo/.
Regulation of transcription; Promoter sequence; Transcription factor binding site; Parzen window
We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.
multiple sequence alignment; metagemone; protein structure; progressive alignment; parallel processing
Multiple transcription factors (TFs) are involved in the generation of gene expression patterns, such as tissue-specific gene expression and pleiotropic immune responses. However, how combinations of TFs orchestrate diverse gene expression patterns is poorly understood. Here we propose a new measure for regulatory motif co-occurrence and a new methodology to systematically identify TF pairs significantly co-occurring in a set of promoter sequences.
Initial analyses suggest that non-CpG promoters have a higher potential for combinatorial regulation than CpG island-associated promoters, and that co-occurrences are strongly influenced by motif similarity. We applied our method to large-scale gene expression data from various tissues, and showed how our measure for motif co-occurrence is not biased by motif over-representation. Our method identified, amongst others, the binding motifs of HNF1 and FOXP1 to be significantly co-occurring in promoters of liver/kidney specific genes. Binding sites tend to be positioned proximally to each other, suggesting interactions exist between this pair of transcription factors. Moreover, the binding sites of several TFs were found to co-occur with NF-κB and IRF sites in sets of genes with similar expression patterns in dendritic cells after Toll-like receptor stimulation. Of these, we experimentally verified that CCAAT enhancer binding protein alpha positively regulates its target promoters synergistically with NF-κB.
Both computational and experimental results indicate that the proposed method can clarify TF interactions that could not be observed by currently available prediction methods.
We describe the development of new force fields for protein side chain modeling called OSCAR (Optimized Side Chain Atomic eneRgy). The distance-dependent energy functions (OSCAR-d) and side-chain dihedral angle potential energy functions were represented as power and Fourier series, respectively. The resulting 802 adjustable parameters were optimized by discriminating the native side chain conformations from non-native conformations, using a training set of 12000 side-chains for each residue type. In the course of optimization, for every residue, its side chain was replaced by varying rotamers, whereas conformations for all other residues were kept as they appeared in the crystal structure. Then the OSCAR-d were multiplied by an orientation dependent function to yield OSCAR-o. 1087 parameters of the orientation-dependent energy functions (OSCAR-o) were optimized by maximizing the energy gap between the native conformation and subrotamers calculated as low energy by OSCAR-d. When OSCAR-o with optimized parameters were used to model side chain conformations simultaneously for 218 recently released protein structures, the prediction accuracies were 88.8% for χ1, 79.7% for χ1+2, 1.24 Å overall RMSD (root mean square deviation), and 0.62 Å RMSD for core residues, respectively, compared with the next-best performing side-chain modeling program which achieved 86.6% for χ1, 75.7% for χ1+2, 1.40 Å overall RMSD, and 0.86 Å RMSD for core residues, respectively. The continuous energy functions obtained in this study are suitable for gradient-based optimization techniques for protein structure refinement. A program with built-in OSCAR for protein side chain prediction is available for download at http://sysimm.ifrec.osaka-u.ac.jp/OSCAR/.
Regulation of gene expression, protein synthesis, replication and assembly of many viruses involve RNA–protein interactions. Although some successful computational tools have been reported to recognize RNA binding sites in proteins, the problem of specificity remains poorly investigated. After the nucleotide base composition, the dinucleotide is the smallest unit of RNA sequence information and many RNA-binding proteins simply bind to regions enriched in one dinucleotide. Interaction preferences of protein subsequences and dinucleotides can be inferred from protein-RNA complex structures, enabling a training-based prediction approach.
We analyzed basic statistics of amino acid-dinucleotide contacts in protein-RNA complexes and found their pairing preferences could be identified. Using a standard approach to represent protein subsequences by their evolutionary profile, we trained neural networks to predict multiclass target vectors corresponding to 16 possible contacting dinucleotide subsequences. In the cross-validation experiments, the accuracies of the optimum network, measured as areas under the curve (AUC) of the receiver operating characteristic (ROC) graphs, were in the range of 65-80%.
Dinucleotide-specific contact predictions have also been extended to the prediction of interacting protein and RNA fragment pairs, which shows the applicability of this method to predict targets of RNA-binding proteins. A web server predicting the 16-dimensional contact probability matrix directly from a user-defined protein sequence was implemented and made available at: http://tardis.nibio.go.jp/netasa/srcpred.
Many viruses contain genes that originate from their hosts. Some of these acquired genes give viruses the ability to interfere with host immune responses by various mechanisms. Genes of host origin that appear commonly in viruses code for proteins that span a wide range of functions, from kinases and phosphotases, to cytokines and their receptors, to ubiquitin ligases and proteases. While many important cases of such lateral gene transfer in viruses have been documented, there has yet to be a genome-wide survey of viral-encoded genes acquired from animal hosts.
Here we carry out such a survey in order to gain insight into the host immune system. We made the results available in the form of a web-based tool that allows viral-centered or host-centered queries to be performed (http://imm.ifrec.osaka-u.ac.jp/musvirus/). We examine the relationship between acquired genes and immune function, and compare host-virus homology with gene expression data in stimulated dendritic cells and T-cells. We found that genes whose expression changes significantly during the innate antiviral immune response had more homologs in animal virus than genes whose expression did not change or genes involved in the adaptive immune response.
Statistics gathered from the MusVirus database support earlier reports of gene transfer from host to virus and indicate that viruses are more likely to acquire genes involved in innate antiviral immune responses than those involved in acquired immune responses.
The Protein Data Bank Japan (PDBj, http://pdbj.org) is a member of the worldwide Protein Data Bank (wwPDB) and accepts and processes the deposited data of experimentally determined macromolecular structures. While maintaining the archive in collaboration with other wwPDB partners, PDBj also provides a wide range of services and tools for analyzing structures and functions of proteins, which are summarized in this article. To enhance the interoperability of the PDB data, we have recently developed PDB/RDF, PDB data in the Resource Description Framework (RDF) format, along with its ontology in the Web Ontology Language (OWL) based on the PDB mmCIF Exchange Dictionary. Being in the standard format for the Semantic Web, the PDB/RDF data provide a means to integrate the PDB with other biological information resources.
Proteasomes are multisubunit proteases that play a critical role in maintaining cellular function through the selective degradation of ubiquitinated proteins. When 3 additional β subunits, expression of which is induced by IFN-γ, are substituted for their constitutively expressed counterparts, the structure is converted to an immunoproteasome. However, the underlying roles of immunoproteasomes in human diseases are poorly understood. Using exome analysis, we found a homozygous missense mutation (G197V) in immunoproteasome subunit, β type 8 (PSMB8), which encodes one of the β subunits induced by IFN-γ in patients from 2 consanguineous families. Patients bearing this mutation suffered from autoinflammatory responses that included recurrent fever and nodular erythema together with lipodystrophy. This mutation increased assembly intermediates of immunoproteasomes, resulting in decreased proteasome function and ubiquitin-coupled protein accumulation in the patient’s tissues. In the patient’s skin and B cells, IL-6 was highly expressed, and there was reduced expression of PSMB8. Downregulation of PSMB8 inhibited the differentiation of murine and human adipocytes in vitro, and injection of siRNA against Psmb8 in mouse skin reduced adipocyte tissue volume. These findings identify PSMB8 as an essential component and regulator not only of inflammation, but also of adipocyte differentiation, and indicate that immunoproteasomes have pleiotropic functions in maintaining the homeostasis of a variety of cell types.
Summary: We developed a fast and accurate side-chain modeling program [Optimized Side Chain Atomic eneRgy (OSCAR)-star] based on orientation-dependent energy functions and a rigid rotamer model. The average computing time was 18 s per protein for 218 test proteins with higher prediction accuracy (1.1% increase for χ1 and 0.8% increase for χ1+2) than the best performing program developed by other groups. We show that the energy functions, which were calibrated to tolerate the discrete errors of rigid rotamers, are appropriate for protein loop selection, especially for decoys without extensive structural refinement.
Availability: OSCAR-star and the 218 test proteins are available for download at http://sysimm.ifrec.osaka-u.ac.jp/OSCAR
Supplementary information: Supplementary data are available at Bioinformatics online.
Macrophages represent the front lines of our immune system; they recognize and engulf pathogens or foreign particles thus initiating the immune response. Imaging macrophages presents unique challenges, as most optical techniques require labeling or staining of the cellular compartments in order to resolve organelles, and such stains or labels have the potential to perturb the cell, particularly in cases where incomplete information exists regarding the precise cellular reaction under observation. Label-free imaging techniques such as Raman microscopy are thus valuable tools for studying the transformations that occur in immune cells upon activation, both on the molecular and organelle levels. Due to extremely low signal levels, however, Raman microscopy requires sophisticated image processing techniques for noise reduction and signal extraction. To date, efficient, automated algorithms for resolving sub-cellular features in noisy, multi-dimensional image sets have not been explored extensively.
We show that hybrid z-score normalization and standard regression (Z-LSR) can highlight the spectral differences within the cell and provide image contrast dependent on spectral content. In contrast to typical Raman imaging processing methods using multivariate analysis, such as single value decomposition (SVD), our implementation of the Z-LSR method can operate nearly in real-time. In spite of its computational simplicity, Z-LSR can automatically remove background and bias in the signal, improve the resolution of spatially distributed spectral differences and enable sub-cellular features to be resolved in Raman microscopy images of mouse macrophage cells. Significantly, the Z-LSR processed images automatically exhibited subcellular architectures whereas SVD, in general, requires human assistance in selecting the components of interest.
The computational efficiency of Z-LSR enables automated resolution of sub-cellular features in large Raman microscopy data sets without compromise in image quality or information loss in associated spectra. These results motivate further use of label free microscopy techniques in real-time imaging of live immune cells.
In order to characterize mammalian intrinsically disordered domains (IDDs) we examined the patterns in their amino acid abundance as well as overrepresented local sequence motifs. We considered IDDs from mouse proteins associated with innate immune responses as well as a set of generic human genes. These sets were compared with artificially generated random sequences with the same overall amino acid abundance and length distributions. IDDs were then clustered by amino acid abundance, and further analyzed in terms of co-occurrence of clusters with functionally characterized Pfam domains.
Overall, IDDs were very different from randomly generated sequences. The deviation from random distributions was at least as great as that for ordered domains, for which the deviation can be rationalized in terms of strong evolutionary pressure for structure and function. The co-occurrence of certain Pfam domains with specific IDD clusters was found to be significant (p-value < 0.01). Local sequence motifs that were over-represented in the innate immune set consisted mostly of low complexity fragments, primarily characterized by amino acid repeats, and could not be assigned an obvious functional role.
Our results suggest that IDDs are constrained within a narrow subset of possible sequences. This is most likely a result of biophysical restraints that have yet to be elucidated. More detailed examination of the functional relationship between the IDDs and associated Pfam domains is one possible avenue of investigation.
Accurate prediction of antigenic epitopes is important for immunologic research and medical applications, but it is still an open problem in bioinformatics. The case for discontinuous epitopes is even worse - currently there are only a few discontinuous epitope prediction servers available, though discontinuous peptides constitute the majority of all B-cell antigenic epitopes. The small number of structures for antigen-antibody complexes limits the development of reliable discontinuous epitope prediction methods and an unbiased benchmark to evaluate developed methods.
In this work, we present two novel server applications for discontinuous epitope prediction: EPSVR and EPMeta, where EPMeta is a meta server. EPSVR, EPMeta, and datasets are available at http://sysbio.unl.edu/services.
The server application for discontinuous epitope prediction, EPSVR, uses a Support Vector Regression (SVR) method to integrate six scoring terms. Furthermore, we combined EPSVR with five existing epitope prediction servers to construct EPMeta. All methods were benchmarked by our curated independent test set, in which all antigens had no complex structures with the antibody, and their epitopes were identified by various biochemical experiments. The area under the receiver operating characteristic curve (AUC) of EPSVR was 0.597, higher than that of any other existing single server, and EPMeta had a better performance than any single server - with an AUC of 0.638, significantly higher than PEPITO and Disctope (p-value < 0.05).
Infection by Toxoplasma gondii down-regulates the host innate immune responses, such as proinflammatory cytokine production, in a Stat3-dependent manner. A forward genetic approach recently demonstrated that the type II strain fails to suppress immune responses because of a potential defect in a highly polymorphic parasite-derived kinase, ROP16. We generated ROP16-deficient type I parasites by reverse genetics and found a severe defect in parasite-induced Stat3 activation, culminating in enhanced production of interleukin (IL) 6 and IL-12 p40 in the infected macrophages. Furthermore, overexpression of ROP16 but not ROP18 in mammalian cells resulted in Stat3 phosphorylation and strong activation of Stat3-dependent promoters. In addition, kinase-inactive ROP16 failed to activate Stat3. Comparison of type I and type II ROP16 revealed that a single amino acid substitution in the kinase domain determined the strain difference in terms of Stat3 activation. Moreover, ROP16 bound Stat3 and directly induced phosphorylation of this transcription factor. These results formally establish an essential and direct requirement of ROP16 in parasite-induced Stat3 activation and the significance of a single amino acid replacement in the function of type II ROP16.
Motivation: Functional similarity between proteins is evident at both the sequence and structure levels. SeSAW is a web-based program for identifying functionally or evolutionarily conserved motifs in protein structures by locating sequence and structural similarities, and quantifying these at the level of individual residues. Results can be visualized in 2D, as annotated alignments, or in 3D, as structural superpositions. An example is given for both an experimentally determined query structure and a homology model.
Availability and Implementation: The web server is located at http://www.pdbj.org/SeSAW/
The anaerobic lifestyle of the intestinal parasite Blastocystis raises questions about the biochemistry and function of its mitochondria-like organelles. We have characterized the Blastocystis succinyl-CoA synthetase (SCS), a tricarboxylic acid cycle enzyme that conserves energy by substrate-level phosphorylation. We show that SCS localizes to the enigmatic Blastocystis organelles, indicating that these organelles might play a similar role in energy metabolism as classic mitochondria. Although analysis of residues inside the nucleotide-binding site suggests that Blastocystis SCS is GTP-specific, we demonstrate that it is ATP-specific. Homology modelling, followed by flexible docking and molecular dynamics simulations, indicates that while both ATP and GTP fit into the Blastocystis SCS active site, GTP is destabilized by electrostatic dipole interactions with Lys 42 and Lys 110, the side-chains of which lie outside the nucleotide-binding cavity. It has been proposed that residues in direct contact with the substrate determine nucleotide specificity in SCS. However, our results indicate that, in Blastocystis, an electrostatic gatekeeper controls which ligands can enter the binding site.
Structure alignment methods offer the possibility of measuring distant evolutionary relationships between proteins that are not visible by sequence-based analysis. However, the question of how structural differences and similarities ought to be quantified in this regard remains open. In this study we construct a training set of sequence-unique CATH and SCOP domains, from which we develop a scoring function that can reliably identify domains with the same CATH topology and SCOP fold classification. The score is implemented in the ASH structure alignment package, for which the source code and a web service are freely available from the PDBj website .
The new ASH score shows increased selectivity and sensitivity compared with values reported for several popular programs using the same test set of 4,298,905 structure pairs, yielding an area of .96 under the receiver operating characteristic (ROC) curve. In addition, weak sequence homologies between similar domains are revealed that could not be detected by BLAST sequence alignment. Also, a subset of domain pairs is identified that exhibit high similarity, even though their CATH and SCOP classification differs. Finally, we show that the ranking of alignment programs based solely on geometric measures depends on the choice of the quality measure.
ASH shows high selectivity and sensitivity with regard to domain classification, an important step in defining distantly related protein sequence families. Moreover, the CPU cost per alignment is competitive with the fastest programs, making ASH a practical option for large-scale structure classification studies.
We introduce GASH, a new, publicly accessible program for structural alignment and superposition. Alignments are scored by the Number of Equivalent Residues (NER), a quantitative measure of structural similarity that can be applied to any structural alignment method. Multiple alignments are optimized by conjugate gradient maximization of the NER score within the genetic algorithm framework. Initial alignments are generated by the program Local ASH, and can be supplemented by alignments from any other program.
We compare GASH to DaliLite, CE, and to our earlier program Global ASH on a difficult test set consisting of 3,102 structure pairs, as well as a smaller set derived from the Fischer-Eisenberg set. The extent of alignment crossover, as well as the completeness of the initial set of alignments are examined. The quality of the superpositions is evaluated both by NER and by the number of aligned residues under three different RMSD cutoffs (2,4, and 6Å). In addition to the numerical assessment, the alignments for several biologically related structural pairs are discussed in detail.
Regardless of which criteria is used to judge the superposition accuracy, GASH achieves the best overall performance, followed by DaliLite, Global ASH, and CE. In terms of CPU usage, DaliLite CE and GASH perform similarly for query proteins under 500 residues, but for larger proteins DaliLite is faster than GASH or CE. Both an http interface and a simple object application protocol (SOAP) interface to the GASH program are available at .
Toll-like receptor (TLR) signaling pathways constitute an evolutionarily conserved host defense system that protects against a broad range of infectious agents. Modeling of TLR signaling has been carried out at several levels. Structural models of TLRs and their adaptors, which utilize a small number of structural domains to recognize a diverse range of pathogens, provide a starting point for understanding how pathogens are recognized and signaling events initiated. Various experimental and computational techniques have been used to construct models of downstream signal transduction networks from the measurements of gene expression and chromatin structure under resting and perturbed conditions along with predicted regulatory sequence motifs. Although a complete and accurate mathematical model of all TLR signaling pathways has yet to be derived, many important modules have been identified and investigated, enhancing our understanding of innate immune responses. Extensions of these models based on emerging experimental techniques are discussed. © 2012 Wiley Periodicals, Inc.
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.