Search tips
Search criteria

Results 1-11 (11)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Yan, koos-Kiu")
1.  A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets 
Genome Biology  2011;12(2):R15.
We develop a statistical framework to study the relationship between chromatin features and gene expression. This can be used to predict gene expression of protein coding genes, as well as microRNAs. We demonstrate the prediction in a variety of contexts, focusing particularly on the modENCODE worm datasets. Moreover, our framework reveals the positional contribution around genes (upstream or downstream) of distinct chromatin features to the overall prediction of expression levels.
PMCID: PMC3188797  PMID: 21324173
2.  Architecture of the human regulatory network derived from ENCODE data 
Nature  2012;489(7414):91-100.
Transcription factors (TFs) bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 TFs in 458 ChIP-Seq experiments. We found the combinatorial, co-association of TFs to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the TF binding into a hierarchy and integrated it with other genomic information (e.g. miRNA regulation), forming a dense meta-network. Factors at different levels have different properties: for instance, top-level TFs more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs -- e.g. noise-buffering feed-forward loops. Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (i.e., differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
PMCID: PMC4154057  PMID: 22955619
4.  Construction and Analysis of an Integrated Regulatory Network Derived from High-Throughput Sequencing Data 
PLoS Computational Biology  2011;7(11):e1002190.
We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.
Author Summary
The precise control of gene expression lies at the heart of many biological processes. In eukaryotes, the regulation is performed at multiple levels, mediated by different regulators such as transcription factors and miRNAs, each distinguished by different spatial and temporal characteristics. These regulators are further integrated to form a complex regulatory network responsible for the orchestration. The construction and analysis of such networks is essential for understanding the general design principles. Recent advances in high-throughput techniques like ChIP-Seq and RNA-Seq provide an opportunity by offering a huge amount of binding and expression data. We present a general framework to combine these types of data into an integrated network and perform various topological analyses, including its hierarchical organization and motif enrichment. We find that the integrated network possesses an intrinsic hierarchical organization and is enriched in several network motifs that include both transcription factors and miRNAs. We further demonstrate that the framework can be easily applied to other species like human and mouse. As more and more genome-wide ChIP-Seq and RNA-Seq data are going to be generated in the near future, our methods of data integration have various potential applications.
PMCID: PMC3219617  PMID: 22125477
5.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project 
Gerstein, Mark B. | Lu, Zhi John | Van Nostrand, Eric L. | Cheng, Chao | Arshinoff, Bradley I. | Liu, Tao | Yip, Kevin Y. | Robilotto, Rebecca | Rechtsteiner, Andreas | Ikegami, Kohta | Alves, Pedro | Chateigner, Aurelien | Perry, Marc | Morris, Mitzi | Auerbach, Raymond K. | Feng, Xin | Leng, Jing | Vielle, Anne | Niu, Wei | Rhrissorrakrai, Kahn | Agarwal, Ashish | Alexander, Roger P. | Barber, Galt | Brdlik, Cathleen M. | Brennan, Jennifer | Brouillet, Jeremy Jean | Carr, Adrian | Cheung, Ming-Sin | Clawson, Hiram | Contrino, Sergio | Dannenberg, Luke O. | Dernburg, Abby F. | Desai, Arshad | Dick, Lindsay | Dosé, Andréa C. | Du, Jiang | Egelhofer, Thea | Ercan, Sevinc | Euskirchen, Ghia | Ewing, Brent | Feingold, Elise A. | Gassmann, Reto | Good, Peter J. | Green, Phil | Gullier, Francois | Gutwein, Michelle | Guyer, Mark S. | Habegger, Lukas | Han, Ting | Henikoff, Jorja G. | Henz, Stefan R. | Hinrichs, Angie | Holster, Heather | Hyman, Tony | Iniguez, A. Leo | Janette, Judith | Jensen, Morten | Kato, Masaomi | Kent, W. James | Kephart, Ellen | Khivansara, Vishal | Khurana, Ekta | Kim, John K. | Kolasinska-Zwierz, Paulina | Lai, Eric C. | Latorre, Isabel | Leahey, Amber | Lewis, Suzanna | Lloyd, Paul | Lochovsky, Lucas | Lowdon, Rebecca F. | Lubling, Yaniv | Lyne, Rachel | MacCoss, Michael | Mackowiak, Sebastian D. | Mangone, Marco | McKay, Sheldon | Mecenas, Desirea | Merrihew, Gennifer | Miller, David M. | Muroyama, Andrew | Murray, John I. | Ooi, Siew-Loon | Pham, Hoang | Phippen, Taryn | Preston, Elicia A. | Rajewsky, Nikolaus | Rätsch, Gunnar | Rosenbaum, Heidi | Rozowsky, Joel | Rutherford, Kim | Ruzanov, Peter | Sarov, Mihail | Sasidharan, Rajkumar | Sboner, Andrea | Scheid, Paul | Segal, Eran | Shin, Hyunjin | Shou, Chong | Slack, Frank J. | Slightam, Cindie | Smith, Richard | Spencer, William C. | Stinson, E. O. | Taing, Scott | Takasaki, Teruaki | Vafeados, Dionne | Voronina, Ksenia | Wang, Guilin | Washington, Nicole L. | Whittle, Christina M. | Wu, Beijing | Yan, Koon-Kiu | Zeller, Georg | Zha, Zheng | Zhong, Mei | Zhou, Xingliang | Ahringer, Julie | Strome, Susan | Gunsalus, Kristin C. | Micklem, Gos | Liu, X. Shirley | Reinke, Valerie | Kim, Stuart K. | Hillier, LaDeana W. | Henikoff, Steven | Piano, Fabio | Snyder, Michael | Stein, Lincoln | Lieb, Jason D. | Waterston, Robert H.
Science (New York, N.Y.)  2010;330(6012):1775-1787.
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
PMCID: PMC3142569  PMID: 21177976
6.  The Spread of Scientific Information: Insights from the Web Usage Statistics in PLoS Article-Level Metrics 
PLoS ONE  2011;6(5):e19917.
The presence of web-based communities is a distinctive signature of Web 2.0. The web-based feature means that information propagation within each community is highly facilitated, promoting complex collective dynamics in view of information exchange. In this work, we focus on a community of scientists and study, in particular, how the awareness of a scientific paper is spread. Our work is based on the web usage statistics obtained from the PLoS Article Level Metrics dataset compiled by PLoS. The cumulative number of HTML views was found to follow a long tail distribution which is reasonably well-fitted by a lognormal one. We modeled the diffusion of information by a random multiplicative process, and thus extracted the rates of information spread at different stages after the publication of a paper. We found that the spread of information displays two distinct decay regimes: a rapid downfall in the first month after publication, and a gradual power law decay afterwards. We identified these two regimes with two distinct driving processes: a short-term behavior driven by the fame of a paper, and a long-term behavior consistent with citation statistics. The patterns of information spread were found to be remarkably similar in data from different journals, but there are intrinsic differences for different types of web usage (HTML views and PDF downloads versus XML). These similarities and differences shed light on the theoretical understanding of different complex systems, as well as a better design of the corresponding web applications that is of high potential marketing impact.
PMCID: PMC3095621  PMID: 21603617
7.  Measuring the Evolutionary Rewiring of Biological Networks 
PLoS Computational Biology  2011;7(1):e1001050.
We have accumulated a large amount of biological network data and expect even more to come. Soon, we anticipate being able to compare many different biological networks as we commonly do for molecular sequences. It has long been believed that many of these networks change, or “rewire”, at different rates. It is therefore important to develop a framework to quantify the differences between networks in a unified fashion. We developed such a formalism based on analogy to simple models of sequence evolution, and used it to conduct a systematic study of network rewiring on all the currently available biological networks. We found that, similar to sequences, biological networks show a decreased rate of change at large time divergences, because of saturation in potential substitutions. However, different types of biological networks consistently rewire at different rates. Using comparative genomics and proteomics data, we found a consistent ordering of the rewiring rates: transcription regulatory, phosphorylation regulatory, genetic interaction, miRNA regulatory, protein interaction, and metabolic pathway network, from fast to slow. This ordering was found in all comparisons we did of matched networks between organisms. To gain further intuition on network rewiring, we compared our observed rewirings with those obtained from simulation. We also investigated how readily our formalism could be mapped to other network contexts; in particular, we showed how it could be applied to analyze changes in a range of “commonplace” networks such as family trees, co-authorships and linux-kernel function dependencies.
Author Summary
Biological networks represent various types of molecular organizations in a cell. During evolution, molecules have been shown to change at varying rates. Therefore, it is important to investigate the evolution of biological networks in terms of network rewiring. Understanding how biological networks evolve could eventually help explain the general mechanism of cellular system. In the past decade, a large amount of high-throughput experiments have helped to unravel the different types of networks in a number of species. Recent studies have provided evolutionary rate calculations on individual networks and observed different rewiring rates between them. We have chosen a systematic approach to compare rewiring rate differences among the common types of biological networks utilizing experimental data across species. Our analysis shows that regulatory networks generally evolve faster than non-regulatory collaborative networks. Our analysis also highlights future applications of the approach to address other interesting biological questions.
PMCID: PMC3017101  PMID: 21253555
8.  Analysis of Combinatorial Regulation: Scaling of Partnerships between Regulators with the Number of Governed Targets 
PLoS Computational Biology  2010;6(5):e1000755.
Through combinatorial regulation, regulators partner with each other to control common targets and this allows a small number of regulators to govern many targets. One interesting question is that given this combinatorial regulation, how does the number of regulators scale with the number of targets? Here, we address this question by building and analyzing co-regulation (co-transcription and co-phosphorylation) networks that describe partnerships between regulators controlling common genes. We carry out analyses across five diverse species: Escherichia coli to human. These reveal many properties of partnership networks, such as the absence of a classical power-law degree distribution despite the existence of nodes with many partners. We also find that the number of co-regulatory partnerships follows an exponential saturation curve in relation to the number of targets. (For E. coli and Bacillus subtilis, only the beginning linear part of this curve is evident due to arrangement of genes into operons.) To gain intuition into the saturation process, we relate the biological regulation to more commonplace social contexts where a small number of individuals can form an intricate web of connections on the internet. Indeed, we find that the size of partnership networks saturates even as the complexity of their output increases. We also present a variety of models to account for the saturation phenomenon. In particular, we develop a simple analytical model to show how new partnerships are acquired with an increasing number of target genes; with certain assumptions, it reproduces the observed saturation. Then, we build a more general simulation of network growth and find agreement with a wide range of real networks. Finally, we perform various down-sampling calculations on the observed data to illustrate the robustness of our conclusions.
Author Summary
A regulatory network consists of regulators such as transcription factors or kinases that control the expression or activity of their target genes. Almost always, there are multiple regulators partnering together to control their targets. Compared to more commonplace contexts, these regulators can be thought of as managers in a social or corporate setting controlling their common subordinates. One interesting question that we address here in this study is how the number of governing regulators scales with the number of governed targets. We build and analyze co-regulation (co-transcription and co-phosphorylation) networks that describe partnerships between regulators controlling common genes. We use a simple framework across five species that demonstrate a wide range of evolution: Escherichia coli to human. The analysis reveals many properties of partnership networks and shows that the number of co-regulatory partnerships follows an exponential saturation curve with the number of targets. To gain more intuition, we explore more commonplace contexts and find that exponential saturation relationship also exists in several social networks. Finally, we propose a simple model to explain this relationship that also exists in a simulated evolutionary environment.
PMCID: PMC2877725  PMID: 20523742
9.  Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data 
PLoS ONE  2010;5(1):e8121.
We performed computational reconstruction of the in silico gene regulatory networks in the DREAM3 Challenges. Our task was to learn the networks from two types of data, namely gene expression profiles in deletion strains (the ‘deletion data’) and time series trajectories of gene expression after some initial perturbation (the ‘perturbation data’). In the course of developing the prediction method, we observed that the two types of data contained different and complementary information about the underlying network. In particular, deletion data allow for the detection of direct regulatory activities with strong responses upon the deletion of the regulator while perturbation data provide richer information for the identification of weaker and more complex types of regulation. We applied different techniques to learn the regulation from the two types of data. For deletion data, we learned a noise model to distinguish real signals from random fluctuations using an iterative method. For perturbation data, we used differential equations to model the change of expression levels of a gene along the trajectories due to the regulation of other genes. We tried different models, and combined their predictions. The final predictions were obtained by merging the results from the two types of data. A comparison with the actual regulatory networks suggests that our approach is effective for networks with a range of different sizes. The success of the approach demonstrates the importance of integrating heterogeneous data in network reconstruction.
PMCID: PMC2811182  PMID: 20126643
10.  Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins 
Biology Direct  2007;2:32.
The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins.
We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens. It was found that the histogram of sequence identities p generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ~ p-γ with the value of the exponent γ around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent α ≈ 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate.
We separately measure the short-term ("raw") duplication and deletion rates rdup∗, rdel∗ which include gene copies that will be removed soon after the duplication event and their dramatically reduced long-term counterparts rdup, rdel. High deletion rate among recently duplicated proteins is consistent with a scenario in which they didn't have enough time to significantly change their functional roles and thus are to a large degree disposable. Systematic trends of each of the four duplication/deletion rates with the total number of genes in the genome were analyzed. All but the deletion rate of recent duplicates rdel∗ were shown to systematically increase with Ngenes. Abnormally flat shapes of sequence identity histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more whole-genome duplications. This interpretation is corroborated by our analysis of the genome of Paramecium tetraurelia where the p-4 profile of the histogram is gradually restored by the successive removal of paralogs generated in its four known whole-genome duplication events.
PMCID: PMC2246104  PMID: 18039386
11.  Upstream plasticity and downstream robustness in evolution of molecular networks 
Gene duplication followed by the functional divergence of the resulting pair of paralogous proteins is a major force shaping molecular networks in living organisms. Recent species-wide data for protein-protein interactions and transcriptional regulations allow us to assess the effect of gene duplication on robustness and plasticity of these molecular networks.
We demonstrate that the transcriptional regulation of duplicated genes in baker's yeast Saccharomyces cerevisiae diverges fast so that on average they lose 3% of common transcription factors for every 1% divergence of their amino acid sequences. The set of protein-protein interaction partners of their protein products changes at a slower rate exhibiting a broad plateau for amino acid sequence similarity above 70%. The stability of functional roles of duplicated genes at such relatively low sequence similarity is further corroborated by their ability to substitute for each other in single gene knockout experiments in yeast and RNAi experiments in a nematode worm Caenorhabditis elegans. We also quantified the divergence rate of physical interaction neighborhoods of paralogous proteins in a bacterium Helicobacter pylori and a fly Drosophila melanogaster. However, in the absence of system-wide data on transcription factors' binding in these organisms we could not compare this rate to that of transcriptional regulation of duplicated genes.
For all molecular networks studied in this work we found that even the most distantly related paralogous proteins with amino acid sequence identities around 20% on average have more similar positions within a network than a randomly selected pair of proteins. For yeast we also found that the upstream regulation of genes evolves more rapidly than downstream functions of their protein products. This is in accordance with a view which puts regulatory changes as one of the main driving forces of the evolution. In this context a very important open question is to what extent our results obtained for homologous genes within a single species (paralogs) carries over to homologous proteins in different species (orthologs).
PMCID: PMC385226  PMID: 15070432

Results 1-11 (11)