PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-8 (8)
 

Clipboard (0)
None

Select a Filter Below

Journals
Authors
more »
Year of Publication
Document Types
1.  Gene prediction in metagenomic fragments based on the SVM algorithm 
BMC Bioinformatics  2013;14(Suppl 5):S12.
Background
Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues.
Results
In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies input fragments into phylogenetic groups by a k-mer based sequence binning method. Then, protein-coding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains.
Conclusions
Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
doi:10.1186/1471-2105-14-S5-S12
PMCID: PMC3622649  PMID: 23735199
2.  Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes 
BMC Genomics  2011;12:361.
Background
Shine-Dalgarno (SD) signal has long been viewed as the dominant translation initiation signal in prokaryotes. Recently, leaderless genes, which lack 5'-untranslated regions (5'-UTR) on their mRNAs, have been shown abundant in archaea. However, current large-scale in silico analyses on initiation mechanisms in bacteria are mainly based on the SD-led initiation way, other than the leaderless one. The study of leaderless genes in bacteria remains open, which causes uncertain understanding of translation initiation mechanisms for prokaryotes.
Results
Here, we study signals in translation initiation regions of all genes over 953 bacterial and 72 archaeal genomes, then make an effort to construct an evolutionary scenario in view of leaderless genes in bacteria. With an algorithm designed to identify multi-signal in upstream regions of genes for a genome, we classify all genes into SD-led, TA-led and atypical genes according to the category of the most probable signal in their upstream sequences. Particularly, occurrence of TA-like signals about 10 bp upstream to translation initiation site (TIS) in bacteria most probably means leaderless genes.
Conclusions
Our analysis reveals that leaderless genes are totally widespread, although not dominant, in a variety of bacteria. Especially for Actinobacteria and Deinococcus-Thermus, more than twenty percent of genes are leaderless. Analyzed in closely related bacterial genomes, our results imply that the change of translation initiation mechanisms, which happens between the genes deriving from a common ancestor, is linearly dependent on the phylogenetic relationship. Analysis on the macroevolution of leaderless genes further shows that the proportion of leaderless genes in bacteria has a decreasing trend in evolution.
doi:10.1186/1471-2164-12-361
PMCID: PMC3160421  PMID: 21749696
3.  A mutation degree model for the identification of transcriptional regulatory elements 
BMC Bioinformatics  2011;12:262.
Background
Current approaches for identifying transcriptional regulatory elements are mainly via the combination of two properties, the evolutionary conservation and the overrepresentation of functional elements in the promoters of co-regulated genes. Despite the development of many motif detection algorithms, the discovery of conserved motifs in a wide range of phylogenetically related promoters is still a challenge, especially for the short motifs embedded in distantly related gene promoters or very closely related promoters, or in the situation that there are not enough orthologous genes available.
Results
A mutation degree model is proposed and a new word counting method is developed for the identification of transcriptional regulatory elements from a set of co-expressed genes. The new method comprises two parts: 1) identifying overrepresented oligo-nucleotides in promoters of co-expressed genes, 2) estimating the conservation of the oligo-nucleotides in promoters of phylogenetically related genes by the mutation degree model. Compared with the performance of other algorithms, our method shows the advantages of low false positive rate and higher specificity, especially the robustness to noisy data. Applying the method to co-expressed gene sets from Arabidopsis, most of known cis-elements were successfully detected. The tool and example are available at http://mcube.nju.edu.cn/jwang/lab/soft/ocw/OCW.html.
Conclusions
The mutation degree model proposed in this paper is adapted to phylogenetic data of different qualities, and to a wide range of evolutionary distances. The new word-counting method based on this model has the advantage of better performance in detecting short sequence of cis-elements from co-expressed genes of eukaryotes and is robust to less complete phylogenetic data.
doi:10.1186/1471-2105-12-262
PMCID: PMC3228546  PMID: 21708002
4.  Genome reannotation of Escherichia coli CFT073 with new insights into virulence 
BMC Genomics  2009;10:552.
Background
As one of human pathogens, the genome of Uropathogenic Escherichia coli strain CFT073 was sequenced and published in 2002, which was significant in pathogenetic bacterial genomics research. However, the current RefSeq annotation of this pathogen is now outdated to some degree, due to missing or misannotation of some essential genes associated with its virulence. We carried out a systematic reannotation by combining automated annotation tools with manual efforts to provide a comprehensive understanding of virulence for the CFT073 genome.
Results
The reannotation excluded 608 coding sequences from the RefSeq annotation. Meanwhile, a total of 299 coding sequences were newly added, about one third of them are found in genomic island (GI) regions while more than one fifth of them are located in virulence related regions pathogenicity islands (PAIs). Furthermore, there are totally 341 genes were relocated with their translational initiation sites (TISs), which resulted in a high quality of gene start annotation. In addition, 94 pseudogenes annotated in RefSeq were thoroughly inspected and updated. The number of miscellaneous genes (sRNAs) has been updated from 6 in RefSeq to 46 in the reannotation. Based on the adjustment in the reannotation, subsequent analysis were conducted by both general and case studies on new virulence factors or new virulence-associated genes that are crucial during the urinary tract infections (UTIs) process, including invasion, colonization, nutrition uptaking and population density control. Furthermore, miscellaneous RNAs collected in the reannotation are believed to contribute to the virulence of strain CFT073. The reannotation including the nucleotide data, the original RefSeq annotation, and all reannotated results is freely available via http://mech.ctb.pku.edu.cn/CFT073/.
Conclusion
As a result, the reannotation presents a more comprehensive picture of mechanisms of uropathogenicity of UPEC strain CFT073. The new genes change the view of its uropathogenicity in many respects, particularly by new genes in GI regions and new virulence-associated factors. The reannotation thus functions as an important source by providing new information about genomic structure and organization, and gene function. Moreover, we expect that the detailed analysis will facilitate the studies for exploration of novel virulence mechanisms and help guide experimental design.
doi:10.1186/1471-2164-10-552
PMCID: PMC2785843  PMID: 19930606
5.  Computational evaluation of TIS annotation for prokaryotic genomes 
BMC Bioinformatics  2008;9:160.
Background
Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks.
Results
Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, i.e. over-annotating the longest open reading frame (LORF) and under-annotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes.
Conclusion
Large-scale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies.
doi:10.1186/1471-2105-9-160
PMCID: PMC2362131  PMID: 18366730
6.  A dynamic Bayesian network approach to protein secondary structure prediction 
BMC Bioinformatics  2008;9:49.
Background
Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM).
Results
In this paper, we report a new method of probabilistic nature for protein secondary structure prediction, based on dynamic Bayesian networks (DBN). The new method models the PSI-BLAST profile of a protein sequence using a multivariate Gaussian distribution, and simultaneously takes into account the dependency between the profile and secondary structure and the dependency between profiles of neighboring residues. In addition, a segment length distribution is introduced for each secondary structure state. Tests show that the DBN method has made a significant improvement in the accuracy compared to other pure HMM-type methods. Further improvement is achieved by combining the DBN with an NN, a method called DBNN, which shows better Q3 accuracy than many popular methods and is competitive to the current state-of-the-arts. The most interesting feature of DBN/DBNN is that a significant improvement in the prediction accuracy is achieved when combined with other methods by a simple consensus.
Conclusion
The DBN method using a Gaussian distribution for the PSI-BLAST profile and a high-ordered dependency between profiles of neighboring residues produces significantly better prediction accuracy than other HMM-type probabilistic methods. Owing to their different nature, the DBN and NN combine to form a more accurate method DBNN. Future improvement may be achieved by combining DBNN with a method of SVM type.
doi:10.1186/1471-2105-9-49
PMCID: PMC2266706  PMID: 18218144
7.  ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes 
Nucleic Acids Research  2007;36(Database issue):D114-D119.
Correct annotation of translation initiation site (TIS) is essential for both experiments and bioinformatics studies of prokaryotic translation initiation mechanism as well as understanding of gene regulation and gene structure. Here we describe a comprehensive database ProTISA, which collects TIS confirmed through a variety of available evidences for prokaryotic genomes, including Swiss-Prot experiments record, literature, conserved domain hits and sequence alignment between orthologous genes. Moreover, by combining the predictions from our recently developed TIS post-processor, ProTISA provides a refined annotation for the public database RefSeq. Furthermore, the database annotates the potential regulatory signals associated with translation initiation at the TIS upstream region. As of July 2007, ProTISA includes 440 microbial genomes with more than 390 000 confirmed TISs. The database is available at http://mech.ctb.pku.edu.cn/protisa
doi:10.1093/nar/gkm799
PMCID: PMC2238952  PMID: 17942412
8.  MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes 
BMC Bioinformatics  2007;8:97.
Background
Despite a remarkable success in the computational prediction of genes in Bacteria and Archaea, a lack of comprehensive understanding of prokaryotic gene structures prevents from further elucidation of differences among genomes. It continues to be interesting to develop new ab initio algorithms which not only accurately predict genes, but also facilitate comparative studies of prokaryotic genomes.
Results
This paper describes a new prokaryotic genefinding algorithm based on a comprehensive statistical model of protein coding Open Reading Frames (ORFs) and Translation Initiation Sites (TISs). The former is based on a linguistic "Entropy Density Profile" (EDP) model of coding DNA sequence and the latter comprises several relevant features related to the translation initiation. They are combined to form a so-called Multivariate Entropy Distance (MED) algorithm, MED 2.0, that incorporates several strategies in the iterative program. The iterations enable us to develop a non-supervised learning process and to obtain a set of genome-specific parameters for the gene structure, before making the prediction of genes.
Conclusion
Results of extensive tests show that MED 2.0 achieves a competitive high performance in the gene prediction for both 5' and 3' end matches, compared to the current best prokaryotic gene finders. The advantage of the MED 2.0 is particularly evident for GC-rich genomes and archaeal genomes. Furthermore, the genome-specific parameters given by MED 2.0 match with the current understanding of prokaryotic genomes and may serve as tools for comparative genomic studies. In particular, MED 2.0 is shown to reveal divergent translation initiation mechanisms in archaeal genomes while making a more accurate prediction of TISs compared to the existing gene finders and the current GenBank annotation.
doi:10.1186/1471-2105-8-97
PMCID: PMC1847833  PMID: 17367537

Results 1-8 (8)