Search tips
Search criteria

Results 1-6 (6)

Clipboard (0)
more »
Year of Publication
Document Types
author:("Yin, yanbian")
1.  Computational analyses of transcriptomic data reveal the dynamic organization of the Escherichia coli chromosome under different conditions 
Nucleic Acids Research  2013;41(11):5594-5603.
The circular chromosome of Escherichia coli has been suggested to fold into a collection of sequentially consecutive domains, genes in each of which tend to be co-expressed. It has also been suggested that such domains, forming a partition of the genome, are dynamic with respect to the physiological conditions. However, little is known about which DNA segments of the E. coli genome form these domains and what determines the boundaries of these domain segments. We present a computational model here to partition the circular genome into consecutive segments, theoretically suggestive of the physically folded supercoiled domains, along with a method for predicting such domains under specified conditions. Our model is based on a hypothesis that the genome of E. coli is partitioned into a set of folding domains so that the total number of unfoldings of these domains in the folded chromosome is minimized, where a domain is unfolded when a biological pathway, consisting of genes encoded in this DNA segment, is being activated transcriptionally. Based on this hypothesis, we have predicted seven distinct sets of such domains along the E. coli genome for seven physiological conditions, namely exponential growth, stationary growth, anaerobiosis, heat shock, oxidative stress, nitrogen limitation and SOS responses. These predicted folding domains are highly stable statistically and are generally consistent with the experimental data of DNA binding sites of the nucleoid-associated proteins that assist the folding of these domains, as well as genome-scale protein occupancy profiles, hence supporting our proposed model. Our study established for the first time a strong link between a folded E. coli chromosomal structure and the encoded biological pathways and their activation frequencies.
PMCID: PMC3675479  PMID: 23599001
2.  The percentage of bacterial genes on leading versus lagging strands is influenced by multiple balancing forces 
Nucleic Acids Research  2012;40(17):8210-8218.
The majority of bacterial genes are located on the leading strand, and the percentage of such genes has a large variation across different bacteria. Although some explanations have been proposed, these are at most partial explanations as they cover only small percentages of the genes and do not even consider the ones biased toward the lagging strand. We have carried out a computational study on 725 bacterial genomes, aiming to elucidate other factors that may have influenced the strand location of genes in a bacterium. Our analyses suggest that (i) genes of some functional categories such as ribosome have higher preferences to be on the leading strands; (ii) genes of some functional categories such as transcription factor have higher preferences on the lagging strands; (iii) there is a balancing force that tends to keep genes from all moving to the leading and more efficient strand and (iv) the percentage of leading-strand genes in an bacterium can be accurately explained based on the numbers of genes in the functional categories outlined in (i) and (ii), genome size and gene density, indicating that these numbers implicitly contain the information about the percentage of genes on the leading versus lagging strand in a genome.
PMCID: PMC3458553  PMID: 22735706
3.  dbCAN: a web resource for automated carbohydrate-active enzyme annotation 
Nucleic Acids Research  2012;40(Web Server issue):W445-W451.
Carbohydrate-active enzymes (CAZymes) are very important to the biotech industry, particularly the emerging biofuel industry because CAZymes are responsible for the synthesis, degradation and modification of all the carbohydrates on Earth. We have developed a web resource, dbCAN (, to provide a capability for automated CAZyme signature domain-based annotation for any given protein data set (e.g. proteins from a newly sequenced genome) submitted to our server. To accomplish this, we have explicitly defined a signature domain for every CAZyme family, derived based on the CDD (conserved domain database) search and literature curation. We have also constructed a hidden Markov model to represent the signature domain of each CAZyme family. These CAZyme family-specific HMMs are our key contribution and the foundation for the automated CAZyme annotation.
PMCID: PMC3394287  PMID: 22645317
4.  Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes 
Nucleic Acids Research  2011;39(22):e150.
Existing methods for orthologous gene mapping suffer from two general problems: (i) they are computationally too slow and their results are difficult to interpret for automated large-scale applications when based on phylogenetic analyses; or (ii) they are too prone to making mistakes in dealing with complex situations involving horizontal gene transfers and gene fusion due to the lack of a sound basis when based on sequence similarity information. We present a novel algorithm, Global Optimization Strategy (GOST), for orthologous gene mapping through combining sequence similarity and contextual (working partners) information, using a combinatorial optimization framework. Genome-scale applications of GOST show substantial improvements over the predictions by three popular sequence similarity-based orthology mapping programs. Our analysis indicates that our algorithm overcomes the intrinsic issues faced by sequence similarity-based methods, when orthology mapping involves gene fusions and horizontal gene transfers. Our program runs as efficiently as the most efficient sequence similarity-based algorithm in the public domain. GOST is freely downloadable at
PMCID: PMC3239196  PMID: 21965536
5.  Insights into plant biomass conversion from the genome of the anaerobic thermophilic bacterium Caldicellulosiruptor bescii DSM 6725 
Nucleic Acids Research  2011;39(8):3240-3254.
Caldicellulosiruptor bescii DSM 6725 utilizes various polysaccharides and grows efficiently on untreated high-lignin grasses and hardwood at an optimum temperature of ∼80°C. It is a promising anaerobic bacterium for studying high-temperature biomass conversion. Its genome contains 2666 protein-coding sequences organized into 1209 operons. Expression of 2196 genes (83%) was confirmed experimentally. At least 322 genes appear to have been obtained by lateral gene transfer (LGT). Putative functions were assigned to 364 conserved/hypothetical protein (C/HP) genes. The genome contains 171 and 88 genes related to carbohydrate transport and utilization, respectively. Growth on cellulose led to the up-regulation of 32 carbohydrate-active (CAZy), 61 sugar transport, 25 transcription factor and 234 C/HP genes. Some C/HPs were overproduced on cellulose or xylan, suggesting their involvement in polysaccharide conversion. A unique feature of the genome is enrichment with genes encoding multi-modular, multi-functional CAZy proteins organized into one large cluster, the products of which are proposed to act synergistically on different components of plant cell walls and to aid the ability of C. bescii to convert plant biomass. The high duplication of CAZy domains coupled with the ability to acquire foreign genes by LGT may have allowed the bacterium to rapidly adapt to changing plant biomass-rich environments.
PMCID: PMC3082886  PMID: 21227922
6.  SPD—a web-based secreted protein database 
Nucleic Acids Research  2004;33(Database Issue):D169-D173.
With the improved secreted protein prediction approach and comprehensive data sources, including Swiss-Prot, TrEMBL, RefSeq, Ensembl and CBI-Gene, we have constructed secretomes of human, mouse and rat, with a total of 18 152 secreted proteins. All the entries are ranked according to the prediction confidence. They were further annotated via a proteome annotation pipeline that we developed. We also set up a secreted protein classification pipeline and classified our predicted secreted proteins into different functional categories. To make the dataset more convincing and comprehensive, nine reference datasets are also integrated, such as the secreted proteins from the Gene Ontology Annotation (GOA) system at the European Bioinformatics Institute, and the vertebrate secreted proteins from Swiss-Prot. All these entries were grouped via a TribeMCL based clustering pipeline. We have constructed a web-based secreted protein database, which has been publicly available at Users can browse the database via a GO assignment or chromosomal-location-based interface. Moreover, text query and sequence similarity search are also provided, and the sequence and annotation data can be downloaded freely from the SPD website.
PMCID: PMC540047  PMID: 15608170

Results 1-6 (6)