Search tips
Search criteria

Results 1-20 (20)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
The GPCR genes have a variety of exon-intron structures even though their proteins are all structurally homologous. We have examined all human GPCR genes with at least two functional protein isoforms, totaling 199, aiming to gain an understanding of what may have contributed to the large diversity of the exon-intron structures of the GPCR genes. The 199 genes have a total of 808 known protein splicing isoforms with experimentally verified functions. Our analysis reveals that 1,301 (80.6%) adjacent exon-exon pairs out of the total of 1,613 in the 199 genes have either exactly one exon skipped or the intron in-between retained in at least one of the 808 protein splicing isoforms. This observation has a statistical significance p-value of 2.051762* e−09, assuming that the observed splicing isoforms are independent of the exon-intron structures. Our interpretation of this observation is that the exon boundaries of the GPCR genes are not randomly determined; instead they may be selected to facilitate specific alternative splicing for functional purposes.
PMCID: PMC3905620  PMID: 24467758
gene structure; alternative splicing; G protein-coupled receptor; GPCR; exon-intron structures
2.  Barcode Server: A Visualization-Based Genome Analysis System 
PLoS ONE  2013;8(2):e56726.
We have previously developed a computational method for representing a genome as a barcode image, which makes various genomic features visually apparent. We have demonstrated that this visual capability has made some challenging genome analysis problems relatively easy to solve. We have applied this capability to a number of challenging problems, including (a) identification of horizontally transferred genes, (b) identification of genomic islands with special properties and (c) binning of metagenomic sequences, and achieved highly encouraging results. These application results inspired us to develop this barcode-based genome analysis server for public service, which supports the following capabilities: (a) calculation of the k-mer based barcode image for a provided DNA sequence; (b) detection of sequence fragments in a given genome with distinct barcodes from those of the majority of the genome, (c) clustering of provided DNA sequences into groups having similar barcodes; and (d) homology-based search using Blast against a genome database for any selected genomic regions deemed to have interesting barcodes. The barcode server provides a job management capability, allowing processing of a large number of analysis jobs for barcode-based comparative genome analyses. The barcode server is accessible at
PMCID: PMC3574017  PMID: 23457606
4.  Optimal Mutation Sites for PRE Data Collection and Membrane Protein Structure Prediction 
NMR paramagnetic relaxation enhancement (PRE) measures long-range distances to isotopically labeled residues, providing useful constraints for protein structure prediction. The method usually requires labor-intensive conjugation of nitroxide labels to multiple locations on the protein, one at a time. Here a computational procedure, based on protein sequence and simple secondary structure models, is presented to facilitate optimal placement of a minimum number of labels needed to determine the correct topology of a helical transmembrane protein. Test on DsbB (4 helices) using just one label leads to correct topology prediction in four of five cases, with the predicted structures <6Å to the native structure. Benchmark results using simulated PRE data show we can generally predict correct topology for five and six-to-seven helices using two and three labels, respectively, with an average success rate of 76% and structures of similar precision, showing promises in facilitating experimentally constrained structure prediction of membrane proteins.
PMCID: PMC3099474  PMID: 21481772
transmembrane helical protein; helix packing topology; solution NMR; paramagnetic relaxation enhancement; distance geometry; structure prediction
5.  Genomic Arrangement of Regulons in Bacterial Genomes 
PLoS ONE  2012;7(1):e29496.
Regulons, as groups of transcriptionally co-regulated operons, are the basic units of cellular response systems in bacterial cells. While the concept has been long and widely used in bacterial studies since it was first proposed in 1964, very little is known about how its component operons are arranged in a bacterial genome. We present a computational study to elucidate of the organizational principles of regulons in a bacterial genome, based on the experimentally validated regulons of E. coli and B. subtilis. Our results indicate that (1) genomic locations of transcriptional factors (TFs) are under stronger evolutionary constraints than those of the operons they regulate so changing a TF's genomic location will have larger impact to the bacterium than changing the genomic position of any of its target operons; (2) operons of regulons are generally not uniformly distributed in the genome but tend to form a few closely located clusters, which generally consist of genes working in the same metabolic pathways; and (3) the global arrangement of the component operons of all the regulons in a genome tends to minimize a simple scoring function, indicating that the global arrangement of regulons follows simple organizational principles.
PMCID: PMC3250446  PMID: 22235300
6.  Computational prediction and experimental validation of novel markers for detection of STEC O157:H7 
AIM: To identify and assess the novel makers for detection of Shiga toxin producing Escherichia coli (STEC) O157:H7 with an integrated computational and experimental approach.
METHODS: High-throughput NCBI blast (E-value cutoff e-5) was used to search homologous genes among all sequenced prokaryotic genomes of each gene encoded in each of the three strains of STEC O157:H7 with complete genomes, aiming to find unique genes in O157:H7 as its potential markers. To ensure that the identified markers from the three strains of STEC O157:H7 can serve as general markers for all the STEC O157:H7 strains, a genomic barcode approach was used to select the markers to minimize the possibility of choosing a marker gene as part of a transposable element. Effectiveness of the markers predicted was then validated by running polymerase chain reaction (PCR) on 18 strains of O157:H7 with 5 additional genomes used as negative controls.
RESULTS: The blast search identified 20, 16 and 20 genes, respectively, in the three sequenced strains of STEC O157:H7, which had no homologs in any of the other prokaryotic genomes. Three genes, wzy, Z0372 and Z0344, common to the three gene lists, were selected based on the genomic barcode approach. PCR showed an identification accuracy of 100% on the 18 tested strains and the 5 controls.
CONCLUSION: The three identified novel markers, wzy, Z0372 and Z0344, are highly promising for the detection of STEC O157:H7, in complementary to the known markers.
PMCID: PMC3080728  PMID: 21528067
Shiga toxin producing Escherichia coli O157:H7; Diagnosis; Marker genes; Infectious diseases
7.  A Comparative Analysis of Gene-Expression Data of Multiple Cancer Types 
PLoS ONE  2010;5(10):e13696.
A comparative study of public gene-expression data of seven types of cancers (breast, colon, kidney, lung, pancreatic, prostate and stomach cancers) was conducted with the aim of deriving marker genes, along with associated pathways, that are either common to multiple types of cancers or specific to individual cancers. The analysis results indicate that (a) each of the seven cancer types can be distinguished from its corresponding control tissue based on the expression patterns of a small number of genes, e.g., 2, 3 or 4; (b) the expression patterns of some genes can distinguish multiple cancer types from their corresponding control tissues, potentially serving as general markers for all or some groups of cancers; (c) the proteins encoded by some of these genes are predicted to be blood secretory, thus providing potential cancer markers in blood; (d) the numbers of differentially expressed genes across different cancer types in comparison with their control tissues correlate well with the five-year survival rates associated with the individual cancers; and (e) some metabolic and signaling pathways are abnormally activated or deactivated across all cancer types, while other pathways are more specific to certain cancers or groups of cancers. The novel findings of this study offer considerable insight into these seven cancer types and have the potential to provide exciting new directions for diagnostic and therapeutic development.
PMCID: PMC2965162  PMID: 21060876
8.  An integrated transcriptomic and computational analysis for biomarker identification in gastric cancer 
Nucleic Acids Research  2010;39(4):1197-1207.
This report describes an integrated study on identification of potential markers for gastric cancer in patients’ cancer tissues and sera based on: (i) genome-scale transcriptomic analyses of 80 paired gastric cancer/reference tissues and (ii) computational prediction of blood-secretory proteins supported by experimental validation. Our findings show that: (i) 715 and 150 genes exhibit significantly differential expressions in all cancers and early-stage cancers versus reference tissues, respectively; and a substantial percentage of the alteration is found to be influenced by age and/or by gender; (ii) 21 co-expressed gene clusters have been identified, some of which are specific to certain subtypes or stages of the cancer; (iii) the top-ranked gene signatures give better than 94% classification accuracy between cancer and the reference tissues, some of which are gender-specific; and (iv) 136 of the differentially expressed genes were predicted to have their proteins secreted into blood, 81 of which were detected experimentally in the sera of 13 validation samples and 29 found to have differential abundances in the sera of cancer patients versus controls. Overall, the novel information obtained in this study has led to identification of promising diagnostic markers for gastric cancer and can benefit further analyses of the key (early) abnormalities during its development.
PMCID: PMC3045610  PMID: 20965966
9.  Computational prediction of the osmoregulation network in Synechococcus sp. WH8102 
BMC Genomics  2010;11:291.
Osmotic stress is caused by sudden changes in the impermeable solute concentration around a cell, which induces instantaneous water flow in or out of the cell to balance the concentration. Very little is known about the detailed response mechanism to osmotic stress in marine Synechococcus, one of the major oxygenic phototrophic cyanobacterial genera that contribute greatly to the global CO2 fixation.
We present here a computational study of the osmoregulation network in response to hyperosmotic stress of Synechococcus sp strain WH8102 using comparative genome analyses and computational prediction. In this study, we identified the key transporters, synthetases, signal sensor proteins and transcriptional regulator proteins, and found experimentally that of these proteins, 15 genes showed significantly changed expression levels under a mild hyperosmotic stress.
From the predicted network model, we have made a number of interesting observations about WH8102. Specifically, we found that (i) the organism likely uses glycine betaine as the major osmolyte, and others such as glucosylglycerol, glucosylglycerate, trehalose, sucrose and arginine as the minor osmolytes, making it efficient and adaptable to its changing environment; and (ii) σ38, one of the seven types of σ factors, probably serves as a global regulator coordinating the osmoregulation network and the other relevant networks.
PMCID: PMC2874817  PMID: 20459751
10.  Barcodes for genomes and applications 
BMC Bioinformatics  2008;9:546.
Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1
We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.
These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.
PMCID: PMC2621371  PMID: 19091119
Nucleic Acids Research  2008;37(Database issue):D459-D463.
We present a database DOOR (Database for prOkaryotic OpeRons) containing computationally predicted operons of all the sequenced prokaryotic genomes. All the operons in DOOR are predicted using our own prediction program, which was ranked to be the best among 14 operon prediction programs by a recent independent review. Currently, the DOOR database contains operons for 675 prokaryotic genomes, and supports a number of search capabilities to facilitate easy access and utilization of the information stored in it. Querying the database: the database provides a search capability for a user to find desired operons and associated information through multiple querying methods.Searching for similar operons: the database provides a search capability for a user to find operons that have similar composition and structure to a query operon.Prediction of cis-regulatory motifs: the database provides a capability for motif identification in the promoter regions of a user-specified group of possibly coregulated operons, using motif-finding tools.Operons for RNA genes: the database includes operons for RNA genes.OperonWiki: the database provides a wiki page (OperonWiki) to facilitate interactions between users and the developer of the database. We believe that DOOR provides a useful resource to many biologists working on bacteria and archaea, which can be accessed at
PMCID: PMC2686520  PMID: 18988623
BMC Genomics  2008;9:36.
Mobile genetic elements (MGEs) play an essential role in genome rearrangement and evolution, and are widely used as an important genetic tool.
In this article, we present genetic maps of recently active Insertion Sequence (IS) elements, the simplest form of MGEs, for all sequenced cyanobacteria and archaea, predicted based on the previously identified ~1,500 IS elements. Our predicted IS maps are consistent with the NCBI annotations of the IS elements. By linking the predicted IS elements to various characteristics of the organisms under study and the organism's living conditions, we found that (a) the activities of IS elements heavily depend on the environments where the host organisms live; (b) the number of recently active IS elements in a genome tends to increase with the genome size; (c) the flanking regions of the recently active IS elements are significantly enriched with genes encoding DNA binding factors, transporters and enzymes; and (d) IS movements show no tendency to disrupt operonic structures.
This is the first genome-scale maps of IS elements with detailed structural information on the sequence level. These genetic maps of recently active IS elements and the several interesting observations would help to improve our understanding of how IS elements proliferate and how they are involved in the evolution of the host genomes.
PMCID: PMC2246112  PMID: 18218090
BMC Genomics  2007;8:156.
Phosphorus is an essential element for all life forms. However, it is limiting in most ecological environments where cyanobacteria inhabit. Elucidation of the phosphorus assimilation pathways in cyanobacteria will further our understanding of the physiology and ecology of this important group of microorganisms. However, a systematic study of the Pho regulon, the core of the phosphorus assimilation pathway in a cyanobacterium, is hitherto lacking.
We have predicted and analyzed the Pho regulons in 19 sequenced cyanobacterial genomes using a highly effective scanning algorithm that we have previously developed. Our results show that different cyanobacterial species/ecotypes may encode diverse sets of genes responsible for the utilization of various sources of phosphorus, ranging from inorganic phosphate, phosphodiester, to phosphonates. Unlike in E. coli, some cyanobacterial genes that are directly involved in phosphorus assimilation seem to not be under the regulation of the regulator SphR (orthologue of PhoB in E coli.) in some species/ecotypes. On the other hand, SphR binding sites are found for genes known to play important roles in other biological processes. These genes might serve as bridging points to coordinate the phosphorus assimilation and other biological processes. More interestingly, in three cyanobacterial genomes where no sphR gene is encoded, our results show that there is virtually no functional SphR binding site, suggesting that transcription regulators probably play an important role in retaining their binding sites.
The Pho regulons in cyanobacteria are highly diversified to accommodate to their respective living environments. The phosphorus assimilation pathways in cyanobacteria are probably tightly coupled to a number of other important biological processes. The loss of a regulator may lead to the rapid loss of its binding sites in a genome.
PMCID: PMC1906773  PMID: 17559671
Nucleic Acids Research  2007;35(7):2125-2140.
Functional classification of genes represents a fundamental problem to many biological studies. Most of the existing classification schemes are based on the concepts of homology and orthology, which were originally introduced to study gene evolution but might not be the most appropriate for gene function prediction, particularly at high resolution level. We have recently developed a scheme for hierarchical classification of genes (HCGs) in prokaryotes. In the HCG scheme, the functional equivalence relationships among genes are first assessed through a careful application of both sequence similarity and genomic neighborhood information; and genes are then classified into a hierarchical structure of clusters, where genes in each cluster are functionally equivalent at some resolution level, and the level of resolution goes higher as the clusters become increasingly smaller traveling down the hierarchy. The HCG scheme is validated through comparisons with the taxonomy of the prokaryotic genomes, Clusters of Orthologous Groups (COGs) of genes and the Pfam system. We have applied the HCG scheme to 224 complete prokaryotic genomes, and constructed a HCG database consisting of a forest of 5339 multi-level and 15 770 single-level trees of gene clusters covering ∼93% of the genes of these 224 genomes. The validation results indicate that the HCG scheme not only captures the key features of the existing classification schemes but also provides a much richer organization of genes which can be used for functional prediction of genes at higher resolution and to help reveal evolutionary trace of the genes.
PMCID: PMC1874638  PMID: 17353185
Nucleic Acids Research  2006;35(1):288-298.
We have carried out a systematic analysis of the contribution of a set of selected features that include three new features to the accuracy of operon prediction. Our analyses have led to a number of new insights about operon prediction, including that (i) different features have different levels of discerning power when used on adjacent gene pairs with different ranges of intergenic distance, (ii) certain features are universally useful for operon prediction while others are more genome-specific and (iii) the prediction reliability of operons is dependent on intergenic distances. Based on these new insights, our newly developed operon-prediction program achieves more accurate operon prediction than the previous ones, and it uses features that are most readily available from genomic sequences. Our prediction results indicate that our (non-linear) decision tree-based classifier can predict operons in a prokaryotic genome very accurately when a substantial number of operons in the genome are already known. For example, the prediction accuracy of our program can reach 90.2 and 93.7% on Bacillus subtilis and Escherichia coli genomes, respectively. When no such information is available, our (linear) logistic function-based classifier can reach the prediction accuracy at 84.6 and 83.3% for E.coli and B.subtilis, respectively.
PMCID: PMC1802555  PMID: 17170009
Nucleic Acids Research  2006;34(3):1050-1065.
Deciphering the regulatory networks encoded in the genome of an organism represents one of the most interesting and challenging tasks in the post-genome sequencing era. As an example of this problem, we have predicted a detailed model for the nitrogen assimilation network in cyanobacterium Synechococcus sp. WH 8102 (WH8102) using a computational protocol based on comparative genomics analysis and mining experimental data from related organisms that are relatively well studied. This computational model is in excellent agreement with the microarray gene expression data collected under ammonium-rich versus nitrate-rich growth conditions, suggesting that our computational protocol is capable of predicting biological pathways/networks with high accuracy. We then refined the computational model using the microarray data, and proposed a new model for the nitrogen assimilation network in WH8102. An intriguing discovery from this study is that nitrogen assimilation affects the expression of many genes involved in photosynthesis, suggesting a tight coordination between nitrogen assimilation and photosynthesis processes. Moreover, for some of these genes, this coordination is probably mediated by NtcA through the canonical NtcA promoters in their regulatory regions.
PMCID: PMC1363776  PMID: 16473855
Nucleic Acids Research  2005;33(16):5156-5171.
We have developed a new method for prediction of cis-regulatory binding sites and applied it to predicting NtcA regulated genes in cyanobacteria. The algorithm rigorously utilizes concurrence information of multiple binding sites in the upstream region of a gene and that in the upstream regions of its orthologues in related genomes. A probabilistic model was developed for the evaluation of prediction reliability so that the prediction false positive rate could be well controlled. Using this method, we have predicted multiple new members of the NtcA regulons in nine sequenced cyanobacterial genomes, and showed that the false positive rates of the predictions have been reduced on an average of 40-fold compared to the conventional methods. A detailed analysis of the predictions in each genome showed that a significant portion of our predictions are consistent with previously published results about individual genes. Intriguingly, NtcA promoters are found for many genes involved in various stages of photosynthesis. Although photosynthesis is known to be tightly coordinated with nitrogen assimilation, very little is known about the underlying mechanism. We postulate for the fist time that these genes serve as the regulatory points to orchestrate these two important processes in a cyanobacterial cell.
PMCID: PMC1214546  PMID: 16157864
Nucleic Acids Research  2005;33(9):2822-2837.
We present a computational method for the prediction of functional modules encoded in microbial genomes. In this work, we have also developed a formal measure to quantify the degree of consistency between the predicted and the known modules, and have carried out statistical significance analysis of consistency measures. We first evaluate the functional relationship between two genes from three different perspectives—phylogenetic profile analysis, gene neighborhood analysis and Gene Ontology assignments. We then combine the three different sources of information in the framework of Bayesian inference, and we use the combined information to measure the strength of gene functional relationship. Finally, we apply a threshold-based method to predict functional modules. By applying this method to Escherichia coli K12, we have predicted 185 functional modules. Our predictions are highly consistent with the previously known functional modules in E.coli. The application results have demonstrated that our approach is highly promising for the prediction of functional modules encoded in a microbial genome.
PMCID: PMC1130488  PMID: 15901854
Nucleic Acids Research  2004;32(2):551-561.
Residual dipolar coupling (RDC) represents one of the most exciting emerging NMR techniques for protein structure studies. However, solving a protein structure using RDC data alone is still a highly challenging problem. We report here a computer program, RDC-PROSPECT, for protein structure prediction based on a structural homolog or analog of the target protein in the Protein Data Bank (PDB), which best aligns with the 15N–1H RDC data of the protein recorded in a single ordering medium. Since RDC-PROSPECT uses only RDC data and predicted secondary structure information, its performance is virtually independent of sequence similarity between a target protein and its structural homolog/analog, making it applicable to protein targets beyond the scope of current protein threading techniques. We have tested RDC-PROSPECT on all 15N–1H RDC data (representing 43 proteins) deposited in the BioMagResBank (BMRB) database. The program correctly identified structural folds for 83.7% of the target proteins, and achieved an average alignment accuracy of 98.1% residues within a four-residue shift.
PMCID: PMC373331  PMID: 14744980
Nucleic Acids Research  2003;31(19):5582-5589.
Massive amounts of gene expression data are generated using microarrays for functional studies of genes and gene expression data clustering is a useful tool for studying the functional relationship among genes in a biological process. We have developed a computer package EXCAVATOR for clustering gene expression profiles based on our new framework for representing gene expression data as a minimum spanning tree. EXCAVATOR uses a number of rigorous and efficient clustering algorithms. This program has a number of unique features, including capabilities for: (i) data- constrained clustering; (ii) identification of genes with similar expression profiles to pre-specified seed genes; (iii) cluster identification from a noisy background; (iv) computational comparison between different clustering results of the same data set. EXCAVATOR can be run from a Unix/Linux/DOS shell, from a Java interface or from a Web server. The clustering results can be visualized as colored figures and 2-dimensional plots. Moreover, EXCAVATOR provides a wide range of options for data formats, distance measures, objective functions, clustering algorithms, methods to choose number of clusters, etc. The effectiveness of EXCAVATOR has been demonstrated on several experimental data sets. Its performance compares favorably against the popular K-means clustering method in terms of clustering quality and computing time.
PMCID: PMC206478  PMID: 14500821

Results 1-20 (20)