Search tips
Search criteria

Results 1-12 (12)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Bu, dongdao")
1.  Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles? 
Previous studies show that the same type of bond lengths and angles fit Gaussian distributions well with small standard deviations on high resolution protein structure data. The mean values of these Gaussian distributions have been widely used as ideal bond lengths and angles in bioinformatics. However, we are not aware of any research done to evaluate how accurately we can model protein structures with dihedral angles and ideal bond lengths and angles.
Here, we introduce the protein structure idealization problem. We focus on the protein backbone structure idealization. We describe a fast O(nm/ε) dynamic programming algorithm to find an idealized protein backbone structure that is approximately optimal according to our scoring function. The scoring function evaluates not only the free energy, but also the similarity with the target structure. Thus, the idealized protein structures found by our algorithm are guaranteed to be protein-like and close to the target protein structure.
We have implemented our protein structure idealization algorithm and idealized the high resolution protein structures with low sequence identities of the CULLPDB_PC30_RES1.6_R0.25 data set. We demonstrate that idealized backbone structures always exist with small changes and significantly better free energy. We also applied our algorithm to refine protein pseudo-structures determined in NMR experiments.
PMCID: PMC3655034  PMID: 23442792
Protein structure idealization; Ideal bond length and angle; Dihedral angle space
2.  Finding Nearly Optimal GDT Scores 
Journal of Computational Biology  2011;18(5):693-704.
Global Distance Test (GDT) is one of the commonly accepted measures to assess the quality of predicted protein structures. Given a set of distance thresholds, GDT maximizes the percentage of superimposed (or matched) residue pairs under each threshold, and reports the average of these percentages as the final score. The computation of GDT score was conjectured to be NP-hard. All available methods are heuristic and do not guarantee the optimality of scores. These heuristic strategies usually result in underestimated GDT scores. Contrary to the conjecture, the problem can be solved exactly in polynomial time, albeit the method would be too slow for practical usage. In this paper we propose an efficient tool called OptGDT to obtain GDT scores with theoretically guaranteed accuracies. Denote ℓ as the number of matched residue pairs found by OptGDT for a given threshold d. Let ℓ′ be the optimal number of matched residues pairs for threshold d/(1 + ε), where ε is a parameter in our computation. OptGDT guarantees that ℓ ≥ ℓ′. We applied our tool to CASP8 (The eighth Critical Assessment of Structure Prediction Techniques) data. For 87.3% of the predicted models, better GDT scores are obtained when OptGDT is used. In some cases, the number of matched residue pairs were improved by at least 10%. The tool runs in time O(n3 log n/ε5) for a given threshold d and parameter ε. In the case of globular proteins, the tool can be improved to a randomized algorithm of O(n log2 n) runtime with probability at least 1 − O(1/n). Released under the GPL license and downloadable from∼scli/OptGDT/.
PMCID: PMC3607910  PMID: 21554017
algorithms; alignment; computational molecular biology; linear programming; protein folding
3.  ProbPS: A new model for peak selection based on quantifying the dependence of the existence of derivative peaks on primary ion intensity 
BMC Bioinformatics  2011;12:346.
The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity and is accompanied by derivative peaks, including isotopic peaks, neutral loss peaks, and complementary peaks. Existing models for peak selection ignore the dependence between the existence of the derivative peaks and the intensity of the primary peaks. Simple models for peak selection assume that these two attributes are independent; however, this assumption is contrary to real data and prone to error.
In this paper, we present a statistical model to quantitatively measure the dependence of the derivative peak's existence on the primary peak's intensity. Here, we propose a statistical model, named ProbPS, to capture the dependence in a quantitative manner and describe a statistical model for peak selection. Our results show that the quantitative understanding can successfully guide the peak selection process. By comparing ProbPS with AuDeNS we demonstrate the advantages of our method in both filtering out noise peaks and in improving de novo identification. In addition, we present a tag identification approach based on our peak selection method. Our results, using a test data set, suggest that our tag identification method (876 correct tags in 1000 spectra) outperforms PepNovoTag (790 correct tags in 1000 spectra).
We have shown that ProbPS improves the accuracy of peak selection which further enhances the performance of de novo sequencing and tag identification. Thus, our model saves valuable computation time and improving the accuracy of the results.
PMCID: PMC3179969  PMID: 21849060
4.  PI: An open-source software package for validation of the SEQUEST result and visualization of mass spectrum 
BMC Bioinformatics  2011;12:234.
Tandem mass spectrometry (MS/MS) has emerged as the leading method for high- throughput protein identification in proteomics. Recent technological breakthroughs have dramatically increased the efficiency of MS/MS data generation. Meanwhile, sophisticated algorithms have been developed for identifying proteins from peptide MS/MS data by searching available protein sequence databases for the peptide that is most likely to have produced the observed spectrum. The popular SEQUEST algorithm relies on the cross-correlation between the experimental mass spectrum and the theoretical spectrum of a peptide. It utilizes a simplified fragmentation model that assigns a fixed and identical intensity for all major ions and fixed and lower intensity for their neutral losses. In this way, the common issues involved in predicting theoretical spectra are circumvented. In practice, however, an experimental spectrum is usually not similar to its SEQUEST -predicted theoretical one, and as a result, incorrect identifications are often generated.
Better understanding of peptide fragmentation is required to produce more accurate and sensitive peptide sequencing algorithms. Here, we designed the software PI of novel and exquisite algorithms that make a good use of intensity property of a spectrum.
We designed the software PI with the novel and effective algorithms which made a good use of intensity property of the spectrum. Experiments have shown that PI was able to validate and improve the results of SEQUEST to a more satisfactory degree.
PMCID: PMC3123612  PMID: 21672262
5.  Incorporating Ab Initio energy into threading approaches for protein structure prediction 
BMC Bioinformatics  2011;12(Suppl 1):S54.
Native structures of proteins are formed essentially due to the combining effects of local and distant (in the sense of sequence) interactions among residues. These interaction information are, explicitly or implicitly, encoded into the scoring function in protein structure prediction approaches—threading approaches usually measure an alignment in the sense that how well a sequence adopts an existing structure; while the energy functions in Ab Initio methods are designed to measure how likely a conformation is near-native. Encouraging progress has been observed in structure refinement where knowledge-based or physics-based potentials are designed to capture distant interactions. Thus, it is interesting to investigate whether distant interaction information captured by the Ab Initio energy function can be used to improve threading, especially for the weakly/distant homologous templates.
In this paper, we investigate the possibility to improve alignment-generating through incorporating distant interaction information into the alignment scoring function in a nontrivial approach. Specifically, the distant interaction information is introduced through employing an Ab Initio energy function to evaluate the “partial” decoy built from an alignment. Subsequently, a local search algorithm is utilized to optimize the scoring function.
Experimental results demonstrate that with distant interaction items, the quality of generated alignments are improved on 68 out of 127 query-template pairs in Prosup benchmark. In addition, compared with state-to-art threading methods, our method performs better on alignment accuracy comparison.
Incorporating Ab Initio energy functions into threading can greatly improve alignment accuracy.
PMCID: PMC3044312  PMID: 21342587
6.  Improving consensus contact prediction via server correlation reduction 
Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them.
In this paper, we develop an integer linear programming model for consensus contact prediction. In contrast to the simple majority voting method assuming that all the individual servers are equally important and independent, the newly developed method evaluates their correlation by using maximum likelihood estimation and extracts independent latent servers from them by using principal component analysis. An integer linear programming method is then applied to assign a weight to each latent server to maximize the difference between true contacts and false ones. The proposed method is tested on the CASP7 data set. If the top L/5 predicted contacts are evaluated where L is the protein size, the average accuracy is 73%, which is much higher than that of any previously reported study. Moreover, if only the 15 new fold CASP7 targets are considered, our method achieves an average accuracy of 37%, which is much better than that of the majority voting method, SVM-LOMETS, SVM-SEQ, and SAM-T06. These methods demonstrate an average accuracy of 13.0%, 10.8%, 25.8% and 21.2%, respectively.
Reducing server correlation and optimally combining independent latent servers show a significant improvement over the traditional consensus methods. This approach can hopefully provide a powerful tool for protein structure refinement and prediction use.
PMCID: PMC2689239  PMID: 19419562
7.  Designing succinct structural alphabets 
Bioinformatics  2008;24(13):i182-i189.
Motivation: The 3D structure of a protein sequence can be assembled from the substructures corresponding to small segments of this sequence. For each small sequence segment, there are only a few more likely substructures. We call them the ‘structural alphabet’ for this segment. Classical approaches such as ROSETTA used sequence profile and secondary structure information, to predict structural fragments. In contrast, we utilize more structural information, such as solvent accessibility and contact capacity, for finding structural fragments.
Results: Integer linear programming technique is applied to derive the best combination of these sequence and structural information items. This approach generates significantly more accurate and succinct structural alphabets with more than 50% improvement over the previous accuracies. With these novel structural alphabets, we are able to construct more accurate protein structures than the state-of-art ab initio protein structure prediction programs such as ROSETTA. We are also able to reduce the Kolodny's library size by a factor of 8, at the same accuracy.
Availability: The online FRazor server is under construction,,
PMCID: PMC2718643  PMID: 18586712
8.  A novel scoring schema for peptide identification by searching protein sequence databases using tandem mass spectrometry data 
BMC Bioinformatics  2006;7:222.
Tandem mass spectrometry (MS/MS) is a powerful tool for protein identification. Although great efforts have been made in scoring the correlation between tandem mass spectra and an amino acid sequence database, improvements could be made in three aspects, including characterization ofpeaks in spectra, adoption of effective scoring functions and access to thereliability of matching between peptides and spectra.
A novel scoring function is presented, along with criteria to estimate the performance confidence of the function. Through learning the typesof product ions and the probability of generating them, a hypothetic spectrum was generated for each candidate peptide. Then relative entropy was introduced to measure the similarity between the hypothetic and the observed spectra. Based on the extreme value distribution (EVD) theory, a threshold was chosen to distinguish a true peptide assignment from a random one. Tests on a public MS/MS dataset demonstrated that this method performs better than the well-known SEQUEST.
A reliable identification of proteins from the spectra promises a more efficient application of tandem mass spectrometry to proteomes with high complexity.
PMCID: PMC1463009  PMID: 16638152
9.  The interactome as a tree—an attempt to visualize the protein–protein interaction network in yeast 
Nucleic Acids Research  2004;32(16):4804-4811.
The refinement and high-throughput of protein interaction detection methods offer us a protein–protein interaction network in yeast. The challenge coming along with the network is to find better ways to make it accessible for biological investigation. Visualization would be helpful for extraction of meaningful biological information from the network. However, traditional ways of visualizing the network are unsuitable because of the large number of proteins. Here, we provide a simple but information-rich approach for visualization which integrates topological and biological information. In our method, the topological information such as quasi-cliques or spoke-like modules of the network is extracted into a clustering tree, where biological information spanning from protein functional annotation to expression profile correlations can be annotated onto the representation of it. We have developed a software named PINC based on our approach. Compared with previous clustering methods, our clustering method ADJW performs well both in retaining a meaningful image of the protein interaction network as well as in enriching the image with biological information, therefore is more suitable in visualization of the network.
PMCID: PMC519110  PMID: 15356297
10.  Date of origin of the SARS coronavirus strains 
A new respiratory infectious epidemic, severe acute respiratory syndrome (SARS), broke out and spread throughout the world. By now the putative pathogen of SARS has been identified as a new coronavirus, a single positive-strand RNA virus. RNA viruses commonly have a high rate of genetic mutation. It is therefore important to know the mutation rate of the SARS coronavirus as it spreads through the population. Moreover, finding a date for the last common ancestor of SARS coronavirus strains would be useful for understanding the circumstances surrounding the emergence of the SARS pandemic and the rate at which SARS coronavirus diverge.
We propose a mathematical model to estimate the evolution rate of the SARS coronavirus genome and the time of the last common ancestor of the sequenced SARS strains. Under some common assumptions and justifiable simplifications, a few simple equations incorporating the evolution rate (K) and time of the last common ancestor of the strains (T0) can be deduced. We then implemented the least square method to estimate K and T0 from the dataset of sequences and corresponding times. Monte Carlo stimulation was employed to discuss the results.
Based on 6 strains with accurate dates of host death, we estimated the time of the last common ancestor to be about August or September 2002, and the evolution rate to be about 0.16 base/day, that is, the SARS coronavirus would on average change a base every seven days. We validated our method by dividing the strains into two groups, which coincided with the results from comparative genomics.
The applied method is simple to implement and avoid the difficulty and subjectivity of choosing the root of phylogenetic tree. Based on 6 strains with accurate date of host death, we estimated a time of the last common ancestor, which is coincident with epidemic investigations, and an evolution rate in the same range as that reported for the HIV-1 virus.
PMCID: PMC516801  PMID: 15028113
11.  Topological structure analysis of the protein–protein interaction network in budding yeast 
Nucleic Acids Research  2003;31(9):2443-2450.
Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology. Here, a spectral method derived from graph theory was introduced to uncover hidden topological structures (i.e. quasi-cliques and quasi-bipartites) of complicated protein–protein interaction networks. Our analyses suggest that these hidden topological structures consist of biologically relevant functional groups. This result motivates a new method to predict the function of uncharacterized proteins based on the classification of known proteins within topological structures. Using this spectral analysis method, 48 quasi-cliques and six quasi-bipartites were isolated from a network involving 11 855 interactions among 2617 proteins in budding yeast, and 76 uncharacterized proteins were assigned functions.
PMCID: PMC154226  PMID: 12711690
12.  The Genomes of Oryza sativa: A History of Duplications 
Yu, Jun | Wang, Jun | Lin, Wei | Li, Songgang | Li, Heng | Zhou, Jun | Ni, Peixiang | Dong, Wei | Hu, Songnian | Zeng, Changqing | Zhang, Jianguo | Zhang, Yong | Li, Ruiqiang | Xu, Zuyuan | Li, Shengting | Li, Xianran | Zheng, Hongkun | Cong, Lijuan | Lin, Liang | Yin, Jianning | Geng, Jianing | Li, Guangyuan | Shi, Jianping | Liu, Juan | Lv, Hong | Li, Jun | Wang, Jing | Deng, Yajun | Ran, Longhua | Shi, Xiaoli | Wang, Xiyin | Wu, Qingfa | Li, Changfeng | Ren, Xiaoyu | Wang, Jingqiang | Wang, Xiaoling | Li, Dawei | Liu, Dongyuan | Zhang, Xiaowei | Ji, Zhendong | Zhao, Wenming | Sun, Yongqiao | Zhang, Zhenpeng | Bao, Jingyue | Han, Yujun | Dong, Lingli | Ji, Jia | Chen, Peng | Wu, Shuming | Liu, Jinsong | Xiao, Ying | Bu, Dongbo | Tan, Jianlong | Yang, Li | Ye, Chen | Zhang, Jingfen | Xu, Jingyi | Zhou, Yan | Yu, Yingpu | Zhang, Bing | Zhuang, Shulin | Wei, Haibin | Liu, Bin | Lei, Meng | Yu, Hong | Li, Yuanzhe | Xu, Hao | Wei, Shulin | He, Ximiao | Fang, Lijun | Zhang, Zengjin | Zhang, Yunze | Huang, Xiangang | Su, Zhixi | Tong, Wei | Li, Jinhong | Tong, Zongzhong | Li, Shuangli | Ye, Jia | Wang, Lishun | Fang, Lin | Lei, Tingting | Chen, Chen | Chen, Huan | Xu, Zhao | Li, Haihong | Huang, Haiyan | Zhang, Feng | Xu, Huayong | Li, Na | Zhao, Caifeng | Li, Shuting | Dong, Lijun | Huang, Yanqing | Li, Long | Xi, Yan | Qi, Qiuhui | Li, Wenjie | Zhang, Bo | Hu, Wei | Zhang, Yanling | Tian, Xiangjun | Jiao, Yongzhi | Liang, Xiaohu | Jin, Jiao | Gao, Lei | Zheng, Weimou | Hao, Bailin | Liu, Siqi | Wang, Wen | Yuan, Longping | Cao, Mengliang | McDermott, Jason | Samudrala, Ram | Wang, Jian | Wong, Gane Ka-Shu | Yang, Huanming
PLoS Biology  2005;3(2):e38.
We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000–40,000. Only 2%–3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family.
Comparative genome sequencing of indica and japonica rice reveals that duplication of genes and genomic regions has played a major part in the evolution of grass genomes
PMCID: PMC546038  PMID: 15685292

Results 1-12 (12)