Protein structure data in Protein Data Bank (PDB) are widely used in studies of protein function and evolution and in protein structure prediction. However, there are two main barriers in large-scale usage of PDB data: 1) PDB data are highly redundant in terms of sequence and structure similarity; and 2) many PDB files have issues due to inconsistency of data and standards as well as missing residues, so that automated retrieval and analysis are often difficult.
To address these issues, we have created MUFOLD-DB http://mufold.org/mufolddb.php, a web-based database, to collect and process the weekly PDB files thereby providing users with non-redundant, cleaned and partially-predicted structure data. For each of the non-redundant sequences, we annotate the SCOP domain classification and predict structures of missing regions by loop modelling. In addition, evolutional information, secondary structure, disorder region, and processed three-dimensional structure are computed and visualized to help users better understand the protein.
MUFOLD-DB integrates processed PDB sequence and structure data and multiple computational results, provides a friendly interface for users to retrieve, browse and download these data, and offers several useful functionalities to facilitate users' data operation.
A new real-space refinement method for low-resolution X-ray crystallography is presented. The method is based on the molecular dynamics flexible fitting protocol targeted at addressing large-scale deformations of the search model to achieve refinement with minimal manual intervention. An explanation of the method is provided, augmented by results from the refinement of both synthetic and experimental low-resolution data, including an independent electrophysiological verification of the xMDFF-refined crystal structure of a voltage-sensor protein.
X-ray crystallography remains the most dominant method for solving atomic structures. However, for relatively large systems, the availability of only medium-to-low-resolution diffraction data often limits the determination of all-atom details. A new molecular dynamics flexible fitting (MDFF)-based approach, xMDFF, for determining structures from such low-resolution crystallographic data is reported. xMDFF employs a real-space refinement scheme that flexibly fits atomic models into an iteratively updating electron-density map. It addresses significant large-scale deformations of the initial model to fit the low-resolution density, as tested with synthetic low-resolution maps of d-ribose-binding protein. xMDFF has been successfully applied to re-refine six low-resolution protein structures of varying sizes that had already been submitted to the Protein Data Bank. Finally, via systematic refinement of a series of data from 3.6 to 7 Å resolution, xMDFF refinements together with electrophysiology experiments were used to validate the first all-atom structure of the voltage-sensing protein Ci-VSP.
xMDFF; molecular dynamics flexible fitting
While more than a thousand protein kinases (PK) have been identified in the Arabidopsis thaliana genome, relatively little progress has been made towards identifying their individual client proteins. Herein we describe the use of a mass spectrometry-based in vitro phosphorylation strategy, termed Kinase Client assay (KiC assay), to study a targeted-aspect of signaling. A synthetic peptide library comprising 377 in vivo phosphorylation sequences from developing seed was screened using 71 recombinant A. thaliana PK. Among the initial results, we identified 23 proteins as putative clients of 17 PK. In one instance protein phosphatase inhibitor-2 (AtPPI-2) was phosphorylated at multiple-sites by three distinct PK, casein kinase 1-like 10, AME3, and a Ser PK-like protein. To confirm this result, full-length recombinant AtPPI-2 was reconstituted with each of these PK. The results confirmed multiple distinct phosphorylation sites within this protein. Biochemical analyses indicate that AtPPI-2 inhibits type 1 protein phosphatase (TOPP) activity, and that the phosphorylated forms of AtPPI-2 are more potent inhibitors. Structural modeling revealed that phosphorylation of AtPPI-2 induces conformational changes that modulate TOPP binding.
mass spectrometry; kinase; phosphorylation; protein-protein interaction; phosphatase inhibitor; signaling network
De novo protein structure prediction often generates a large population of candidates (models), and then selects near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. In this paper, we present a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise root mean square deviation and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and root mean square deviation and the correlation between Dscore2 and TM-score are high. Compared to the existing methods with calculation time quadratic to the number of models, our Dscore1-based clustering achieves a linearly time complexity while obtaining almost the same accuracy for near-native model selection. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500 k) models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLD-CL, available at http://mufold.org/clustering.php.
Bioinformatics; Distance matrix; Dscore; Near-native model selection; Protein model clustering; Visualization of distance distribution
Quality assessment (QA) for predicted protein structural models is an important and challenging research problem in protein structure prediction. Consensus Global Distance Test (CGDT) methods assess each decoy (predicted structural model) based on its structural similarity to all others in a decoy set and has been proved to work well when good decoys are in a majority cluster. Scoring functions evaluate each single decoy based on its structural properties. Both methods have their merits and limitations. In this paper, we present a novel method called PWCom, which consists of two neural networks sequentially to combine CGDT and single model scoring methods such as RW, DDFire and OPUS-Ca. Specifically, for every pair of decoys, the difference of the corresponding feature vectors is input to the first neural network which enables one to predict whether the decoy-pair are significantly different in terms of their GDT scores to the native. If yes, the second neural network is used to decide which one of the two is closer to the native structure. The quality score for each decoy in the pool is based on the number of winning times during the pairwise comparisons. Test results on three benchmark datasets from different model generation methods showed that PWCom significantly improves over consensus GDT and single scoring methods. The QA server (MUFOLD-Server) applying this method in CASP 10 QA category was ranked the second place in terms of Pearson and Spearman correlation performance.
Salinity is one of the most common abiotic stresses in agriculture production. Salt tolerance of rice (Oryza sativa) is an important trait controlled by various genes. The mechanism of rice salt tolerance, currently with limited understanding, is of great interest to molecular breeding in improving grain yield. In this study, a gene regulatory network of rice salt tolerance is constructed using a systems biology approach with a number of novel computational methods. We developed an improved volcano plot method in conjunction with a new machine-learning method for gene selection based on gene expression data and applied the method to choose genes related to salt tolerance in rice. The results were then assessed by quantitative trait loci (QTL), co-expression and regulatory binding motif analysis. The selected genes were constructed into a number of network modules based on predicted protein interactions including modules of phosphorylation activity, ubiquity activity, and several proteinase activities such as peroxidase, aspartic proteinase, glucosyltransferase, and flavonol synthase. All of these discovered modules are related to the salt tolerance mechanism of signal transduction, ion pump, abscisic acid mediation, reactive oxygen species scavenging and ion sequestration. We also predicted the three-dimensional structures of some crucial proteins related to the salt tolerance QTL for understanding the roles of these proteins in the network. Our computational study sheds some new light on the mechanism of salt tolerance and provides a systems biology pipeline for studying plant traits in general.
The aim of this study was to investigate the correlation between aortic/carotid atherosclerotic plaques and cerebral infarction. We examined 116 cases of cerebral infarction using transcranial Doppler ultrasound in order to exclude cerebrovascular stenosis. Transesophageal echocardiography and color Doppler ultrasound were used to detect aortic atherosclerotic plaques (AAPs) and carotid atherosclerotic plaques (CAPs). AAPs were detected in a total of 70 of the 116 cases (60.3%), including 56 with moderate/severe atherosclerotic changes (48.3%). The difference in the incidence of various types of infarction between APP severity levels was significant (P<0.01). Of the 116 cases, 64 had CAPs (55.2%), including 46 with unstable plaque (39.7%). The difference in the incidence of various types of infarction between CAP stability levels was significant (P<0.01). The results indicate that moderate/severe AAP and unstable CAP are significant causes of embolic infarction without stenosis in the internal carotid arteries.
aortic atherosclerotic plaque; carotid atherosclerotic plaque; cerebral infarction
Protein tertiary structures are essential for studying functions of proteins at molecular level. An indispensable approach for protein structure solution is computational prediction. Most protein structure prediction methods generate candidate models first and select the best candidates by model quality assessment (QA). In many cases, good models can be produced but the QA tools fail to select the best ones from the candidate model pool. Because of incomplete understanding of protein folding, each QA method only reflects partial facets of a structure model, and thus, has limited discerning power with no one consistently outperforming others. In this paper, we developed a set of new QA methods, including two QA methods for target/template alignments, a molecular dynamics (MD) based QA method, and three consensus QA methods with selected references to reveal new facets of protein structures complementary to the existing methods. Moreover, the underlying relationship among different QA methods were analyzed and then integrated into a multi-layer evaluation approach to guide the model generation and model selection in prediction. All methods are integrated and implemented into an innovative and improved prediction system hereafter referred to as MUFOLD. In CASP8 and CASP9 MUFOLD has demonstrated the proof of the principles in terms of both QA discerning power and structure prediction accuracy.
Protein structure prediction; Structural model quality assessment; Consensus quality assessment; CASP; MUFOLD
The mitochondrial pyruvate dehydrogenase complex (mtPDC) is regulated by reversible seryl-phosphorylation of the E1α subunit by a dedicated, intrinsic kinase. The phospho-complex is reactivated when dephosphorylated by an intrinsic PP2C-type protein phosphatase. Both the position of the phosphorylated Ser-residue and the sequences of the flanking amino acids are highly conserved. We have used the synthetic peptide-based kinase client (KiC) assay plus recombinant pyruvate dehydrogenase E1α and E1α-kinase to perform “scanning mutagenesis” of the residues flanking the site of phosphorylation. Consistent with the results from “phylogenetic analysis” of the flanking sequences, the direct peptide-based kinase assays tolerated very few changes. Even conservative changes such as Leu, Ile, or Val for Met, or Glu for Asp, gave very marked reductions in phosphorylation. Overall the results indicate that regulation of the mtPDC by reversible phosphorylation is an extreme example of multiple, interdependent instances of co-evolution.
KiC assay; mass specrometry; mitochondrial; phosphorylation site; pyruvate dehydrogenase complex; synthetic peptides
There have been steady improvements in protein structure prediction during the past 2 decades. However, current methods are still far from consistently predicting structural models accurately with computing power accessible to common users. Toward achieving more accurate and efficient structure prediction, we developed a number of novel methods and integrated them into a software package, MUFOLD. First, a systematic protocol was developed to identify useful templates and fragments from Protein Data Bank for a given target protein. Then, an efficient process was applied for iterative coarse-grain model generation and evaluation at the Cα or backbone level. In this process, we construct models using interresidue spatial restraints derived from alignments by multidimensional scaling, evaluate and select models through clustering and static scoring functions, and iteratively improve the selected models by integrating spatial restraints and previous models. Finally, the full-atom models were evaluated using molecular dynamics simulations based on structural changes under simulated heating. We have continuously improved the performance of MUFOLD by using a benchmark of 200 proteins from the Astral database, where no template with >25% sequence identity to any target protein is included. The average root-mean-square deviation of the best models from the native structures is 4.28 Å, which shows significant and systematic improvement over our previous methods. The computing time of MUFOLD is much shorter than many other tools, such as Rosetta. MUFOLD demonstrated some success in the 2008 community-wide experiment for protein structure prediction CASP8.
protein structure prediction; CASP; multidimensional scaling; scoring function; clustering; molecular dynamics simulation
The refinement and high-throughput of protein interaction detection methods offer us a protein–protein interaction network in yeast. The challenge coming along with the network is to find better ways to make it accessible for biological investigation. Visualization would be helpful for extraction of meaningful biological information from the network. However, traditional ways of visualizing the network are unsuitable because of the large number of proteins. Here, we provide a simple but information-rich approach for visualization which integrates topological and biological information. In our method, the topological information such as quasi-cliques or spoke-like modules of the network is extracted into a clustering tree, where biological information spanning from protein functional annotation to expression profile correlations can be annotated onto the representation of it. We have developed a software named PINC based on our approach. Compared with previous clustering methods, our clustering method ADJW performs well both in retaining a meaningful image of the protein interaction network as well as in enriching the image with biological information, therefore is more suitable in visualization of the network.
A new respiratory infectious epidemic, severe acute respiratory syndrome (SARS), broke out and spread throughout the world. By now the putative pathogen of SARS has been identified as a new coronavirus, a single positive-strand RNA virus. RNA viruses commonly have a high rate of genetic mutation. It is therefore important to know the mutation rate of the SARS coronavirus as it spreads through the population. Moreover, finding a date for the last common ancestor of SARS coronavirus strains would be useful for understanding the circumstances surrounding the emergence of the SARS pandemic and the rate at which SARS coronavirus diverge.
We propose a mathematical model to estimate the evolution rate of the SARS coronavirus genome and the time of the last common ancestor of the sequenced SARS strains. Under some common assumptions and justifiable simplifications, a few simple equations incorporating the evolution rate (K) and time of the last common ancestor of the strains (T0) can be deduced. We then implemented the least square method to estimate K and T0 from the dataset of sequences and corresponding times. Monte Carlo stimulation was employed to discuss the results.
Based on 6 strains with accurate dates of host death, we estimated the time of the last common ancestor to be about August or September 2002, and the evolution rate to be about 0.16 base/day, that is, the SARS coronavirus would on average change a base every seven days. We validated our method by dividing the strains into two groups, which coincided with the results from comparative genomics.
The applied method is simple to implement and avoid the difficulty and subjectivity of choosing the root of phylogenetic tree. Based on 6 strains with accurate date of host death, we estimated a time of the last common ancestor, which is coincident with epidemic investigations, and an evolution rate in the same range as that reported for the HIV-1 virus.
Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology. Here, a spectral method derived from graph theory was introduced to uncover hidden topological structures (i.e. quasi-cliques and quasi-bipartites) of complicated protein–protein interaction networks. Our analyses suggest that these hidden topological structures consist of biologically relevant functional groups. This result motivates a new method to predict the function of uncharacterized proteins based on the classification of known proteins within topological structures. Using this spectral analysis method, 48 quasi-cliques and six quasi-bipartites were isolated from a network involving 11 855 interactions among 2617 proteins in budding yeast, and 76 uncharacterized proteins were assigned functions.
We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000–40,000. Only 2%–3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family.
Comparative genome sequencing of indica and japonica rice reveals that duplication of genes and genomic regions has played a major part in the evolution of grass genomes