Search tips
Search criteria

Results 1-18 (18)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Genomic heterogeneity of multiple synchronous lung cancer 
Nature Communications  2016;7:13200.
Multiple synchronous lung cancers (MSLCs) present a clinical dilemma as to whether individual tumours represent intrapulmonary metastases or independent tumours. In this study we analyse genomic profiles of 15 lung adenocarcinomas and one regional lymph node metastasis from 6 patients with MSLC. All 15 lung tumours demonstrate distinct genomic profiles, suggesting all are independent primary tumours, which are consistent with comprehensive histopathological assessment in 5 of the 6 patients. Lung tumours of the same individuals are no more similar to each other than are lung adenocarcinomas of different patients from TCGA cohort matched for tumour size and smoking status. Several known cancer-associated genes have different mutations in different tumours from the same patients. These findings suggest that in the context of identical constitutional genetic background and environmental exposure, different lung cancers in the same individual may have distinct genomic profiles and can be driven by distinct molecular events.
Some patients present with multiple lung tumours but it is unclear whether these are metastases or individual lesions. Here, the authors use genomics techniques to demonstrate in six patients that multiple tumours have individual genetic profiles and represent separate tumours.
PMCID: PMC5078731  PMID: 27767028
2.  Comprehensive characterization of the genomic alterations in human gastric cancer 
International journal of cancer  2014;137(1):86-95.
Gastric cancer is one of the most prevalent and aggressive cancers worldwide, and its molecular mechanism remains largely elusive. Here we report the genomic landscape in primary gastric adenocarcinoma of human, based on the complete genome sequences of five pairs of cancer and matching normal samples. In total, 103,464 somatic point mutations, including 407 nonsynonymous ones, were identified and the most recurrent mutations were harbored by Mucins (MUC3A and MUC12) and transcription factors (ZNF717, ZNF595 and TP53). 679 genomic rearrangements were detected, which affect 355 protein-coding genes; and 76 genes show copy number changes. Through mapping the boundaries of the rearranged regions to the folded three-dimensional structure of human chromosomes, we determined that 79.6% of the chromosomal rearrangements happen among DNA fragments in close spatial proximity, especially when two endpoints stay in a similar replication phase. We demonstrated evidences that microhomology-mediated break-induced replication was utilized as a mechanism in inducing ~40.9% of the identified genomic changes in gastric tumor. Our data analyses revealed potential integrations of Helicobacter pylori DNA into the gastric cancer genomes. Overall a large set of novel genomic variations were detected in these gastric cancer genomes, which may be essential to the study of the genetic basis and molecular mechanism of the gastric tumorigenesis.
PMCID: PMC4776643  PMID: 25422082
gastric cancer; next-generation sequencing; bioinformatics; genomic variations; cancer mutations
3.  Revisiting operons: an analysis of the landscape of transcriptional units in E. coli 
BMC Bioinformatics  2015;16:356.
Bacterial operons are considerably more complex than what were thought. At least their components are dynamically rather than statically defined as previously assumed. Here we present a computational study of the landscape of the transcriptional units (TUs) of E. coli K12, revealed by the available genomic and transcriptomic data, providing new understanding about the complexity of TUs as a whole encoded in the genome of E. coli K12.
Results and conclusion
Our main findings include that (i) different TUs may overlap with each other by sharing common genes, giving rise to clusters of overlapped TUs (TUCs) along the genomic sequence; (ii) the intergenic regions in front of the first gene of each TU tend to have more conserved sequence motifs than those of the other genes inside the TU, suggesting that TUs each have their own promoters; (iii) the terminators associated with the 3’ ends of TUCs tend to be Rho-independent terminators, substantially more often than terminators of TUs that end inside a TUC; and (iv) the functional relatedness of adjacent gene pairs in individual TUs is higher than those in TUCs, suggesting that individual TUs are more basic functional units than TUCs.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0805-8) contains supplementary material, which is available to authorized users.
PMCID: PMC4634151  PMID: 26538447
Operon; Transcriptional unit; Promoter; Terminator; Bacteria
4.  DMINDA: an integrated web server for DNA motif identification and analyses 
Nucleic Acids Research  2014;42(Web Server issue):W12-W19.
DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular.
PMCID: PMC4086085  PMID: 24753419
5.  DOOR 2.0: presenting operons and their functions through dynamic and integrated views 
Nucleic Acids Research  2013;42(Database issue):D654-D659.
We have recently developed a new version of the DOOR operon database, DOOR 2.0, which is available online at and will be updated on a regular basis. DOOR 2.0 contains genome-scale operons for 2072 prokaryotes with complete genomes, three times the number of genomes covered in the previous version published in 2009. DOOR 2.0 has a number of new features, compared with its previous version, including (i) more than 250 000 transcription units, experimentally validated or computationally predicted based on RNA-seq data, providing a dynamic functional view of the underlying operons; (ii) an integrated operon-centric data resource that provides not only operons for each covered genome but also their functional and regulatory information such as their cis-regulatory binding sites for transcription initiation and termination, gene expression levels estimated based on RNA-seq data and conservation information across multiple genomes; (iii) a high-performance web service for online operon prediction on user-provided genomic sequences; (iv) an intuitive genome browser to support visualization of user-selected data; and (v) a keyword-based Google-like search engine for finding the needed information intuitively and rapidly in this database.
PMCID: PMC3965076  PMID: 24214966
6.  Elucidation of How Cancer Cells Avoid Acidosis through Comparative Transcriptomic Data Analysis 
PLoS ONE  2013;8(8):e71177.
The rapid growth of cancer cells fueled by glycolysis produces large amounts of protons in cancer cells, which tri mechanisms to transport them out, hence leading to increased acidity in their extracellular environments. It has been well established that the increased acidity will induce cell death of normal cells but not cancer cells. The main question we address here is: how cancer cells deal with the increased acidity to avoid the activation of apoptosis. We have carried out a comparative analysis of transcriptomic data of six solid cancer types, breast, colon, liver, two lung (adenocarcinoma, squamous cell carcinoma) and prostate cancers, and proposed a model of how cancer cells utilize a few mechanisms to keep the protons outside of the cells. The model consists of a number of previously, well or partially, studied mechanisms for transporting out the excess protons, such as through the monocarboxylate transporters, V-ATPases, NHEs and the one facilitated by carbonic anhydrases. In addition we propose a new mechanism that neutralizes protons through the conversion of glutamate to γ-aminobutyrate, which consumes one proton per reaction. We hypothesize that these processes are regulated by cancer related conditions such as hypoxia and growth factors and by the pH levels, making these encoded processes not available to normal cells under acidic conditions.
PMCID: PMC3743895  PMID: 23967163
7.  CINPER: An Interactive Web System for Pathway Prediction for Prokaryotes 
PLoS ONE  2012;7(12):e51252.
We present a web-based network-construction system, CINPER (CSBL INteractive Pathway BuildER), to assist a user to build a user-specified gene network for a prokaryotic organism in an intuitive manner. CINPER builds a network model based on different types of information provided by the user and stored in the system. CINPER’s prediction process has four steps: (i) collection of template networks based on (partially) known pathways of related organism(s) from the SEED or BioCyc database and the published literature; (ii) construction of an initial network model based on the template networks using the P-Map program; (iii) expansion of the initial model, based on the association information derived from operons, protein-protein interactions, co-expression modules and phylogenetic profiles; and (iv) computational validation of the predicted models based on gene expression data. To facilitate easy applications, CINPER provides an interactive visualization environment for a user to enter, search and edit relevant data and for the system to display (partial) results and prompt for additional data. Evaluation of CINPER on 17 well-studied pathways in the MetaCyc database shows that the program achieves an average recall rate of 76% and an average precision rate of 90% on the initial models; and a higher average recall rate at 87% and an average precision rate at 28% on the final models. The reduced precision rate in the final models versus the initial models reflects the reality that the final models have large numbers of novel genes that have no experimental evidences and hence are not yet collected in the MetaCyc database. To demonstrate the usefulness of this server, we have predicted an iron homeostasis gene network of Synechocystis sp. PCC6803 using the server. The predicted models along with the server can be accessed at
PMCID: PMC3517448  PMID: 23236458
8.  The percentage of bacterial genes on leading versus lagging strands is influenced by multiple balancing forces 
Nucleic Acids Research  2012;40(17):8210-8218.
The majority of bacterial genes are located on the leading strand, and the percentage of such genes has a large variation across different bacteria. Although some explanations have been proposed, these are at most partial explanations as they cover only small percentages of the genes and do not even consider the ones biased toward the lagging strand. We have carried out a computational study on 725 bacterial genomes, aiming to elucidate other factors that may have influenced the strand location of genes in a bacterium. Our analyses suggest that (i) genes of some functional categories such as ribosome have higher preferences to be on the leading strands; (ii) genes of some functional categories such as transcription factor have higher preferences on the lagging strands; (iii) there is a balancing force that tends to keep genes from all moving to the leading and more efficient strand and (iv) the percentage of leading-strand genes in an bacterium can be accurately explained based on the numbers of genes in the functional categories outlined in (i) and (ii), genome size and gene density, indicating that these numbers implicitly contain the information about the percentage of genes on the leading versus lagging strand in a genome.
PMCID: PMC3458553  PMID: 22735706
10.  dbCAN: a web resource for automated carbohydrate-active enzyme annotation 
Nucleic Acids Research  2012;40(Web Server issue):W445-W451.
Carbohydrate-active enzymes (CAZymes) are very important to the biotech industry, particularly the emerging biofuel industry because CAZymes are responsible for the synthesis, degradation and modification of all the carbohydrates on Earth. We have developed a web resource, dbCAN (, to provide a capability for automated CAZyme signature domain-based annotation for any given protein data set (e.g. proteins from a newly sequenced genome) submitted to our server. To accomplish this, we have explicitly defined a signature domain for every CAZyme family, derived based on the CDD (conserved domain database) search and literature curation. We have also constructed a hidden Markov model to represent the signature domain of each CAZyme family. These CAZyme family-specific HMMs are our key contribution and the foundation for the automated CAZyme annotation.
PMCID: PMC3394287  PMID: 22645317
11.  A Comparative Study of Gene-Expression Data of Basal Cell Carcinoma and Melanoma Reveals New Insights about the Two Cancers 
PLoS ONE  2012;7(1):e30750.
A comparative analysis of genome-scale transcriptomic data of two types of skin cancers, melanoma and basal cell carcinoma in comparison with other cancer types, was conducted with the aim of identifying key regulatory factors that either cause or contribute to the aggressiveness of melanoma, while basal cell carcinoma generally remains a mild disease. Multiple cancer-related pathways such as cell proliferation, apoptosis, angiogenesis, cell invasion and metastasis, are considered, but our focus is on energy metabolism, cell invasion and metastasis pathways. Our findings include the following. (a) Both types of skin cancers use both glycolysis and increased oxidative phosphorylation (electron transfer chain) for their energy supply. (b) Advanced melanoma shows substantial up-regulation of key genes involved in fatty acid metabolism (β-oxidation) and oxidative phosphorylation, with aerobic metabolism being far more efficient than anaerobic glycolysis, providing a source of the energetics necessary to support the rapid growth of this cancer. (c) While advanced melanoma is similar to pancreatic cancer in terms of the activity level of genes involved in promoting cell invasion and metastasis, the main metastatic form of basal cell carcinoma is substantially reduced in this activity, partially explaining why this cancer type has been considered as far less aggressive. Our method of using comparative analyses of transcriptomic data of multiple cancer types focused on specific pathways provides a novel and highly effective approach to cancer studies in general.
PMCID: PMC3266277  PMID: 22295108
12.  Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes 
Nucleic Acids Research  2011;39(22):e150.
Existing methods for orthologous gene mapping suffer from two general problems: (i) they are computationally too slow and their results are difficult to interpret for automated large-scale applications when based on phylogenetic analyses; or (ii) they are too prone to making mistakes in dealing with complex situations involving horizontal gene transfers and gene fusion due to the lack of a sound basis when based on sequence similarity information. We present a novel algorithm, Global Optimization Strategy (GOST), for orthologous gene mapping through combining sequence similarity and contextual (working partners) information, using a combinatorial optimization framework. Genome-scale applications of GOST show substantial improvements over the predictions by three popular sequence similarity-based orthology mapping programs. Our analysis indicates that our algorithm overcomes the intrinsic issues faced by sequence similarity-based methods, when orthology mapping involves gene fusions and horizontal gene transfers. Our program runs as efficiently as the most efficient sequence similarity-based algorithm in the public domain. GOST is freely downloadable at
PMCID: PMC3239196  PMID: 21965536
13.  SEAS: A System for SEED-Based Pathway Enrichment Analysis 
PLoS ONE  2011;6(7):e22556.
Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at
PMCID: PMC3142180  PMID: 21799897
14.  KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases 
Nucleic Acids Research  2011;39(Web Server issue):W316-W322.
High-throughput experimental technologies often identify dozens to hundreds of genes related to, or changed in, a biological or pathological process. From these genes one wants to identify biological pathways that may be involved and diseases that may be implicated. Here, we report a web server, KOBAS 2.0, which annotates an input set of genes with putative pathways and disease relationships based on mapping to genes with known annotations. It allows for both ID mapping and cross-species sequence similarity mapping. It then performs statistical tests to identify statistically significantly enriched pathways and diseases. KOBAS 2.0 incorporates knowledge across 1327 species from 5 pathway databases (KEGG PATHWAY, PID, BioCyc, Reactome and Panther) and 5 human disease databases (OMIM, KEGG DISEASE, FunDO, GAD and NHGRI GWAS Catalog). KOBAS 2.0 can be accessed at
PMCID: PMC3125809  PMID: 21715386
15.  Computational prediction of the osmoregulation network in Synechococcus sp. WH8102 
BMC Genomics  2010;11:291.
Osmotic stress is caused by sudden changes in the impermeable solute concentration around a cell, which induces instantaneous water flow in or out of the cell to balance the concentration. Very little is known about the detailed response mechanism to osmotic stress in marine Synechococcus, one of the major oxygenic phototrophic cyanobacterial genera that contribute greatly to the global CO2 fixation.
We present here a computational study of the osmoregulation network in response to hyperosmotic stress of Synechococcus sp strain WH8102 using comparative genome analyses and computational prediction. In this study, we identified the key transporters, synthetases, signal sensor proteins and transcriptional regulator proteins, and found experimentally that of these proteins, 15 genes showed significantly changed expression levels under a mild hyperosmotic stress.
From the predicted network model, we have made a number of interesting observations about WH8102. Specifically, we found that (i) the organism likely uses glycine betaine as the major osmolyte, and others such as glucosylglycerol, glucosylglycerate, trehalose, sucrose and arginine as the minor osmolytes, making it efficient and adaptable to its changing environment; and (ii) σ38, one of the seven types of σ factors, probably serves as a global regulator coordinating the osmoregulation network and the other relevant networks.
PMCID: PMC2874817  PMID: 20459751
16.  Genes and (Common) Pathways Underlying Drug Addiction 
Drug addiction is a serious worldwide problem with strong genetic and environmental influences. Different technologies have revealed a variety of genes and pathways underlying addiction; however, each individual technology can be biased and incomplete. We integrated 2,343 items of evidence from peer-reviewed publications between 1976 and 2006 linking genes and chromosome regions to addiction by single-gene strategies, microrray, proteomics, or genetic studies. We identified 1,500 human addiction-related genes and developed KARG (, the first molecular database for addiction-related genes with extensive annotations and a friendly Web interface. We then performed a meta-analysis of 396 genes that were supported by two or more independent items of evidence to identify 18 molecular pathways that were statistically significantly enriched, covering both upstream signaling events and downstream effects. Five molecular pathways significantly enriched for all four different types of addictive drugs were identified as common pathways which may underlie shared rewarding and addictive actions, including two new ones, GnRH signaling pathway and gap junction. We connected the common pathways into a hypothetical common molecular network for addiction. We observed that fast and slow positive feedback loops were interlinked through CAMKII, which may provide clues to explain some of the irreversible features of addiction.
Author Summary
Drug addiction has become one of the most serious problems in the world. It has been estimated that genetic factors contribute to 40%–60% of the vulnerability to drug addiction, and environmental factors provide the remainder. What are the genes and pathways underlying addiction? Is there a common molecular network underlying addiction to different abusive substances? Is there any network property that may explain the long-lived and often irreversible molecular and structural changes after addiction? These important questions were traditionally studied experimentally. The explosion of genomic and proteomic data in recent years both enabled and necessitated bioinformatic studies of addiction. We integrated data derived from multiple technology platforms and collected 2,343 items of evidence linking genes and chromosome regions to addiction. We identified 18 statistically significantly enriched molecular pathways. In particular, five of them were common for four types of addictive drugs, which may underlie shared rewarding and addictive actions, including two new ones, GnRH signaling pathway and gap junction. We connected the common pathways into a hypothetical common molecular network for addiction. We observed that fast and slow positive feedback loops were interlinked through CAMKII, which may provide clues to explain some of the irreversible features of addiction.
PMCID: PMC2174978  PMID: 18179280
17.  Genome comparison using Gene Ontology (GO) with statistical testing 
BMC Bioinformatics  2006;7:374.
Automated comparison of complete sets of genes encoded in two genomes can provide insight on the genetic basis of differences in biological traits between species. Gene ontology (GO) is used as a common vocabulary to annotate genes for comparison. Current approaches calculate the fold of unweighted or weighted differences between two species at the high-level GO functional categories. However, to ensure the reliability of the differences detected, it is important to evaluate their statistical significance. It is also useful to search for differences at all levels of GO.
We propose a statistical approach to find reliable differences between the complete sets of genes encoded in two genomes at all levels of GO. The genes are first assigned GO terms from BLAST searches against genes with known GO assignments, and for each GO term the abundance of genes in the two genomes is compared using a chi-squared test followed by false discovery rate (FDR) correction. We applied this method to find statistically significant differences between two cyanobacteria, Synechocystis sp. PCC6803 and Anabaena sp. PCC7120. We then studied how the set of identified differences vary when different BLAST cutoffs are used. We also studied how the results vary when only subsets of the genes were used in the comparison of human vs. mouse and that of Saccharomyces cerevisiae vs. Schizosaccharomyces pombe.
There is a surprising lack of statistical approaches for comparing complete genomes at all levels of GO. With the rapid increase of the number of sequenced genomes, we hope that the approach we proposed and tested can make valuable contribution to comparative genomics.
PMCID: PMC1569881  PMID: 16901353
18.  KOBAS server: a web-based platform for automated annotation and pathway identification 
Nucleic Acids Research  2006;34(Web Server issue):W720-W724.
There is an increasing need to automatically annotate a set of genes or proteins (from genome sequencing, DNA microarray analysis or protein 2D gel experiments) using controlled vocabularies and identify the pathways involved, especially the statistically enriched pathways. We have previously demonstrated the KEGG Orthology (KO) as an effective alternative controlled vocabulary and developed a standalone KO-Based Annotation System (KOBAS). Here we report a KOBAS server with a friendly web-based user interface and enhanced functionalities. The server can support input by nucleotide or amino acid sequences or by sequence identifiers in popular databases and can annotate the input with KO terms and KEGG pathways by BLAST sequence similarity or directly ID mapping to genes with known annotations. The server can then identify both frequent and statistically enriched pathways, offering the choices of four statistical tests and the option of multiple testing correction. The server also has a ‘User Space’ in which frequent users may store and manage their data and results online. We demonstrate the usability of the server by finding statistically enriched pathways in a set of upregulated genes in Alzheimer's Disease (AD) hippocampal cornu ammonis 1 (CA1). KOBAS server can be accessed at .
PMCID: PMC1538915  PMID: 16845106

Results 1-18 (18)