Search tips
Search criteria

Results 1-17 (17)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Constraint Programming Based Biomarker Optimization 
BioMed Research International  2015;2015:910515.
Efficient and intuitive characterization of biological big data is becoming a major challenge for modern bio-OMIC based scientists. Interactive visualization and exploration of big data is proven to be one of the successful solutions. Most of the existing feature selection algorithms do not allow the interactive inputs from users in the optimizing process of feature selection. This study investigates this question as fixing a few user-input features in the finally selected feature subset and formulates these user-input features as constraints for a programming model. The proposed algorithm, fsCoP (feature selection based on constrained programming), performs well similar to or much better than the existing feature selection algorithms, even with the constraints from both literature and the existing algorithms. An fsCoP biomarker may be intriguing for further wet lab validation, since it satisfies both the classification optimization function and the biomedical knowledge. fsCoP may also be used for the interactive exploration of bio-OMIC big data by interactively adding user-defined constraints for modeling.
PMCID: PMC4437250  PMID: 26075274
2.  WinHAP2: an extremely fast haplotype phasing program for long genotype sequences 
BMC Bioinformatics  2014;15:164.
The haplotype phasing problem tries to screen for phenotype associated genomic variations from millions of candidate data. Most of the current computer programs handle this problem with high requirements of computing power and memory. By replacing the computation-intensive step of constructing the maximum spanning tree with a heuristics of estimated initial haplotype, we released the WinHAP algorithm version 1.0, which outperforms the other algorithms in terms of both running speed and overall accuracy.
This work further speeds up the WinHAP algorithm to version 2.0 (WinHAP2) by utilizing the divide-and-conquer strategy and the OpenMP parallel computing mode. WinHAP2 can phase 500 genotypes with 1,000,000 SNPs using just 12.8 MB in memory and 2.5 hours on a personal computer, whereas the other programs require unacceptable memory or running times. The parallel running mode further improves WinHAP2's running speed with several orders of magnitudes, compared with the other programs, including Beagle, SHAPEIT2 and 2SNP.
WinHAP2 is an extremely fast haplotype phasing program which can handle a large-scale genotyping study with any number of SNPs in the current literature and at least in the near future.
PMCID: PMC4094983  PMID: 24884701
Haplotype phasing; Genotype; SNP; Long sequence; Parallel computing
3.  A novel molecular typing method of Mycobacteria based on DNA barcoding visualization 
Different subtypes of Mycobacterium tuberculosis (MTB) may induce diverse severe human infections, and some of their symptoms are similar to other pathogenes, e.g. Nontuberculosis mycobacteria (NTM). So determination of mycobacterium subtypes facilitates the effective control of MTB infection and proliferation. This study exploits a novel DNA barcoding visualization method for molecular typing of 17 mycobacteria genomes published in the NCBI prokaryotic genome database. Three mycobacterium genes (Rv0279c, Rv3508 and Rv3514) from the PE/PPE family of MT Band were detected to best represent the inter-strain pathogenetic variations. An accurate and fast MTB substrain typing method was proposed based on the combination of the aforementioned three biomarker genes and the 16S rRNA gene. The protocol of establishing a bacterial substrain typing system used in this study may also be applied to the other pathogenes.
PMCID: PMC3931916  PMID: 24555538
Mycobacterium; Molecular typing; Typing biomarker; Bioinformatics; Differential diagnosis of mycobacteria
4.  Normalizing Electrocardiograms of Both Healthy Persons and Cardiovascular Disease Patients for Biometric Authentication 
PLoS ONE  2013;8(8):e71523.
Although electrocardiogram (ECG) fluctuates over time and physical activity, some of its intrinsic measurements serve well as biometric features. Considering its constant availability and difficulty in being faked, the ECG signal is becoming a promising factor for biometric authentication. The majority of the currently available algorithms only work well on healthy participants. A novel normalization and interpolation algorithm is proposed to convert an ECG signal into multiple template cycles, which are comparable between any two ECGs, no matter the sampling rates or health status. The overall accuracies reach 100% and 90.11% for healthy participants and cardiovascular disease (CVD) patients, respectively.
PMCID: PMC3748040  PMID: 23977063
5.  Clinical Effects of Xinmailong Therapy in Patients with Chronic Heart Failure 
In the last 100 years, intensive studies have been done on the identification of the systematic approaches to find the cure for the chronic heart failure, however the mystery remains unresolved due to its complicated pathogenesis and ineffective early diagnosis. The present investigation was aimed to evaluate the potential effects of the traditional chinese medicine, Xinmailong, on the chronic heart failure (CHF) patients as compared to the standard western medical treatment available so far. In our study, we selected two groups of voluntary CHF patients at the Xiangya Hospital, which were allowed to administrate Xinmailong or standard treatments, respectively. Another group of voluntary healthy individuals were recruited as the control group. The treatment effectiveness was measured by five symptomatic factors, i.e. angiotensin II (Ang_II), high sensitivity C-reactive protein (hs_CRP), Left Ventricular End Systolic Volume Index (LVESVI), left ventricular ejection fraction (LVEF) and pro-B-type natriuretic peptide (NT_proBNP), between the control group and the CHF patients at different stages of drug administration and in different treatment groups. The timeline for the full dose administration was set to 15 days and five measurements as indicated above were taken on every 0, 7th and 15th day of the drug administration respectively. In the conducted study, similar symptomatic measurements were observed on day 0 in both treatment groups, and slight improvements were observed on 7th day. It was observed that after a full course of drug administration for 15 days, both of the treatment groups achieved statistically significant improvements in all the five measures, but Xinmailong was found to be more (almost double) statistically significant as compared with the available drug treatments for chronic heart failure.
PMCID: PMC3619101  PMID: 23569425
Chronic heart failure; Traditional Chinese Medicine; Xinmailong.
6.  WinHAP: An Efficient Haplotype Phasing Algorithm Based on Scalable Sliding Windows 
PLoS ONE  2012;7(8):e43163.
Haplotype phasing represents an essential step in studying the association of genomic polymorphisms with complex genetic diseases, and in determining targets for drug designing. In recent years, huge amounts of genotype data are produced from the rapidly evolving high-throughput sequencing technologies, and the data volume challenges the community with more efficient haplotype phasing algorithms, in the senses of both running time and overall accuracy. 2SNP is one of the fastest haplotype phasing algorithms with comparable low error rates with the other algorithms. The most time-consuming step of 2SNP is the construction of a maximum spanning tree (MST) among all the heterozygous SNP pairs. We simplified this step by replacing the MST with the initial haplotypes of adjacent heterozygous SNP pairs. The multi-SNP haplotypes were estimated within a sliding window along the chromosomes. The comparative studies on four different-scale genotype datasets suggest that our algorithm WinHAP outperforms 2SNP and most of the other haplotype phasing algorithms in terms of both running speeds and overall accuracies. To facilitate the WinHAP’s application in more practical biological datasets, we released the software for free at:
PMCID: PMC3419172  PMID: 22905221
7.  QServer: A Biclustering Server for Prediction and Assessment of Co-Expressed Gene Clusters 
PLoS ONE  2012;7(3):e32660.
Biclustering is a powerful technique for identification of co-expressed gene groups under any (unspecified) substantial subset of given experimental conditions, which can be used for elucidation of transcriptionally co-regulated genes.
We have previously developed a biclustering algorithm, QUBIC, which can solve more general biclustering problems than previous biclustering algorithms. To fully utilize the analysis power the algorithm provides, we have developed a web server, QServer, for prediction, computational validation and analyses of co-expressed gene clusters. Specifically, the QServer has the following capabilities in addition to biclustering by QUBIC: (i) prediction and assessment of conserved cis regulatory motifs in promoter sequences of the predicted co-expressed genes; (ii) functional enrichment analyses of the predicted co-expressed gene clusters using Gene Ontology (GO) terms, and (iii) visualization capabilities in support of interactive biclustering analyses. QServer supports the biclustering and functional analysis for a wide range of organisms, including human, mouse, Arabidopsis, bacteria and archaea, whose underlying genome database will be continuously updated.
We believe that QServer provides an easy-to-use and highly effective platform useful for hypothesis formulation and testing related to transcription co-regulation.
PMCID: PMC3293860  PMID: 22403692
8.  Insights into plant biomass conversion from the genome of the anaerobic thermophilic bacterium Caldicellulosiruptor bescii DSM 6725 
Nucleic Acids Research  2011;39(8):3240-3254.
Caldicellulosiruptor bescii DSM 6725 utilizes various polysaccharides and grows efficiently on untreated high-lignin grasses and hardwood at an optimum temperature of ∼80°C. It is a promising anaerobic bacterium for studying high-temperature biomass conversion. Its genome contains 2666 protein-coding sequences organized into 1209 operons. Expression of 2196 genes (83%) was confirmed experimentally. At least 322 genes appear to have been obtained by lateral gene transfer (LGT). Putative functions were assigned to 364 conserved/hypothetical protein (C/HP) genes. The genome contains 171 and 88 genes related to carbohydrate transport and utilization, respectively. Growth on cellulose led to the up-regulation of 32 carbohydrate-active (CAZy), 61 sugar transport, 25 transcription factor and 234 C/HP genes. Some C/HPs were overproduced on cellulose or xylan, suggesting their involvement in polysaccharide conversion. A unique feature of the genome is enrichment with genes encoding multi-modular, multi-functional CAZy proteins organized into one large cluster, the products of which are proposed to act synergistically on different components of plant cell walls and to aid the ability of C. bescii to convert plant biomass. The high duplication of CAZy domains coupled with the ability to acquire foreign genes by LGT may have allowed the bacterium to rapidly adapt to changing plant biomass-rich environments.
PMCID: PMC3082886  PMID: 21227922
9.  cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data 
Bioinformatics  2010;26(16):2051-2052.
Summary: Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves ∼92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy.
Availability: The program has been implemented as a standalone prediction program, cBar, which is available at∼ffzhou/cBar
Supplementary information:Supplementary data are available at Bioinformatics online.
PMCID: PMC2916713  PMID: 20538725
10.  GASdb: a large-scale and comparative exploration database of glycosyl hydrolysis systems 
BMC Microbiology  2010;10:69.
The genomes of numerous cellulolytic organisms have been recently sequenced or in the pipeline of being sequenced. Analyses of these genomes as well as the recently sequenced metagenomes in a systematic manner could possibly lead to discoveries of novel biomass-degradation systems in nature.
We have identified 4,679 and 49,099 free acting glycosyl hydrolases with or without carbohydrate binding domains, respectively, by scanning through all the proteins in the UniProt Knowledgebase and the JGI Metagenome database. Cellulosome components were observed only in bacterial genomes, and 166 cellulosome-dependent glycosyl hydrolases were identified. We observed, from our analysis data, unexpected wide distributions of two less well-studied bacterial glycosyl hydrolysis systems in which glycosyl hydrolases may bind to the cell surface directly rather than through linking to surface anchoring proteins, or cellulosome complexes may bind to the cell surface by novel mechanisms other than the other used SLH domains. In addition, we found that animal-gut metagenomes are substantially enriched with novel glycosyl hydrolases.
The identified biomass degradation systems through our large-scale search are organized into an easy-to-use database GASdb at, which should be useful to both experimental and computational biofuel researchers.
PMCID: PMC2838879  PMID: 20202206
11.  Genome Sequence of the Anaerobic, Thermophilic, and Cellulolytic Bacterium “Anaerocellum thermophilum” DSM 6725▿  
Journal of Bacteriology  2009;191(11):3760-3761.
“Anaerocellum thermophilum” DSM 6725 is a strictly anaerobic bacterium that grows optimally at 75°C. It uses a variety of polysaccharides, including crystalline cellulose and untreated plant biomass, and has potential utility in biomass conversion. Here we report its complete genome sequence of 2.97 Mb, which is contained within one chromosome and two plasmids (of 8.3 and 3.6 kb). The genome encodes a broad set of cellulolytic enzymes, transporters, and pathways for sugar utilization and compared to those of other saccharolytic, anaerobic thermophiles is most similar to that of Caldicellulosiruptor saccharolyticus DSM 8903.
PMCID: PMC2681903  PMID: 19346307
12.  De novo computational prediction of non-coding RNA genes in prokaryotic genomes 
Bioinformatics  2009;25(22):2897-2905.
Motivation: The computational identification of non-coding RNA (ncRNA) genes represents one of the most important and challenging problems in computational biology. Existing methods for ncRNA gene prediction rely mostly on homology information, thus limiting their applications to ncRNA genes with known homologues.
Results: We present a novel de novo prediction algorithm for ncRNA genes using features derived from the sequences and structures of known ncRNA genes in comparison to decoys. Using these features, we have trained a neural network-based classifier and have applied it to Escherichia coli and Sulfolobus solfataricus for genome-wide prediction of ncRNAs. Our method has an average prediction sensitivity and specificity of 68% and 70%, respectively, for identifying windows with potential for ncRNA genes in E.coli. By combining windows of different sizes and using positional filtering strategies, we predicted 601 candidate ncRNAs and recovered 41% of known ncRNAs in E.coli. We experimentally investigated six novel candidates using Northern blot analysis and found expression of three candidates: one represents a potential new ncRNA, one is associated with stable mRNA decay intermediates and one is a case of either a potential riboswitch or transcription attenuator involved in the regulation of cell division. In general, our approach enables the identification of both cis- and trans-acting ncRNAs in partially or completely sequenced microbial genomes without requiring homology or structural conservation.
Availability: The source code and results are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2773258  PMID: 19744996
13.  RepPop: a database for repetitive elements in Populus trichocarpa 
BMC Genomics  2009;10:14.
Populus trichocarpa is the first tree genome to be completed, and its whole genome is currently being assembled. No functional annotation about the repetitive elements in the Populus trichocarpa genome is currently available.
We predicted 9,623 repetitive elements in the Populus trichocarpa genome, and assigned functions to 3,075 of them (31.95%). The 9,623 repetitive elements cover ~40% of the current (partially) assembled genome. Among the 9,623 repetitive elements, 668 have copies only in the contigs that have not been assigned to one of the 19 chromosome while the rest all have copies in the partially assembled chromosomes.
All the predicted data are organized into an easy-to-use web-browsable database, RepPop. Various search capabilities are provided against the RepPop database. A Wiki system has been set up to facilitate functional annotation and curation of the repetitive elements by a community rather than just the database developer. The database RepPop will facilitate the assembling and functional characterization of the Populus trichocarpa genome.
PMCID: PMC2645430  PMID: 19134208
14.  Barcodes for genomes and applications 
BMC Bioinformatics  2008;9:546.
Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1
We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.
These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.
PMCID: PMC2621371  PMID: 19091119
BMC Genomics  2008;9:36.
Mobile genetic elements (MGEs) play an essential role in genome rearrangement and evolution, and are widely used as an important genetic tool.
In this article, we present genetic maps of recently active Insertion Sequence (IS) elements, the simplest form of MGEs, for all sequenced cyanobacteria and archaea, predicted based on the previously identified ~1,500 IS elements. Our predicted IS maps are consistent with the NCBI annotations of the IS elements. By linking the predicted IS elements to various characteristics of the organisms under study and the organism's living conditions, we found that (a) the activities of IS elements heavily depend on the environments where the host organisms live; (b) the number of recently active IS elements in a genome tends to increase with the genome size; (c) the flanking regions of the recently active IS elements are significantly enriched with genes encoding DNA binding factors, transporters and enzymes; and (d) IS movements show no tendency to disrupt operonic structures.
This is the first genome-scale maps of IS elements with detailed structural information on the sequence level. These genetic maps of recently active IS elements and the several interesting observations would help to improve our understanding of how IS elements proliferate and how they are involved in the evolution of the host genomes.
PMCID: PMC2246112  PMID: 18218090
Nucleic Acids Research  2006;34(Web Server issue):W254-W257.
Systematic dissection of the sumoylation proteome is emerging as an appealing but challenging research topic because of the significant roles sumoylation plays in cellular dynamics and plasticity. Although several proteome-scale analyzes have been performed to delineate potential sumoylatable proteins, the bona fide sumoylation sites still remain to be identified. Previously, we carried out a genome-wide analysis of the SUMO substrates in human nucleus using the putative motif ψ-K-X-E and evolutionary conservation. However, a highly specific predictor for in silico prediction of sumoylation sites in any individual organism is still urgently needed to guide experimental design. In this work, we present a computational system SUMOsp—SUMOylation Sites Prediction, based on a manually curated dataset, integrating the results of two methods, GPS and MotifX, which were originally designed for phosphorylation site prediction. SUMOsp offers at least as good prediction performance as the only available method, SUMOplot, on a very large test set. We expect that the prediction results of SUMOsp combined with experimental verifications will propel our understanding of sumoylation mechanisms to a new level. SUMOsp has been implemented on a freely accessible web server at: .
PMCID: PMC1538802  PMID: 16845005
Nucleic Acids Research  2005;33(Web Server issue):W184-W187.
Protein phosphorylation plays a fundamental role in most of the cellular regulatory pathways. Experimental identification of protein kinases' (PKs) substrates with their phosphorylation sites is labor-intensive and often limited by the availability and optimization of enzymatic reactions. Recently, large-scale analysis of the phosphoproteome by the mass spectrometry (MS) has become a popular approach. But experimentally, it is still difficult to distinguish the kinase-specific sites on the substrates. In this regard, the in silico prediction of phosphorylation sites with their specific kinases using protein's primary sequences may provide guidelines for further experimental consideration and interpretation of MS phosphoproteomic data. A variety of such tools exists over the Internet and provides the predictions for at most 30 PK subfamilies. We downloaded the verified phosphorylation sites from the public databases and curated the literature extensively for recently found phosphorylation sites. With the hypothesis that PKs in the same subfamily share similar consensus sequences/motifs/functional patterns on substrates, we clustered the 216 unique PKs in 71 PK groups, according to the BLAST results and protein annotations. Then, we applied the group-based phosphorylation scoring (GPS) method on the data set; here, we present a comprehensive PK-specific prediction server GPS, which could predict kinase-specific phosphorylation sites from protein primary sequences for 71 different PK groups. GPS has been implemented in PHP and is available on a www server at .
PMCID: PMC1160154  PMID: 15980451

Results 1-17 (17)