Search tips
Search criteria

Results 1-25 (62)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Loregic: A Method to Characterize the Cooperative Logic of Regulatory Factors 
PLoS Computational Biology  2015;11(4):e1004132.
The topology of the gene-regulatory network has been extensively analyzed. Now, given the large amount of available functional genomic data, it is possible to go beyond this and systematically study regulatory circuits in terms of logic elements. To this end, we present Loregic, a computational method integrating gene expression and regulatory network data, to characterize the cooperativity of regulatory factors. Loregic uses all 16 possible two-input-one-output logic gates (e.g. AND or XOR) to describe triplets of two factors regulating a common target. We attempt to find the gate that best matches each triplet’s observed gene expression pattern across many conditions. We make Loregic available as a general-purpose tool ( We validate it with known yeast transcription-factor knockout experiments. Next, using human ENCODE ChIP-Seq and TCGA RNA-Seq data, we are able to demonstrate how Loregic characterizes complex circuits involving both proximally and distally regulating transcription factors (TFs) and also miRNAs. Furthermore, we show that MYC, a well-known oncogenic driving TF, can be modeled as acting independently from other TFs (e.g., using OR gates) but antagonistically with repressing miRNAs. Finally, we inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.
Author Summary
Gene expression is controlled by various gene regulatory factors. Those factors work cooperatively forming a complex regulatory circuit genome wide. Corruptions of regulatory cooperativity may lead to abnormal gene expression activity such as cancer. Traditional experimental methods, however, can only identify small-scale regulatory activity. Thus, to systematically understand the cooperativity between and among different types of regulatory factors, we need the efficient and systematic computational methods. Regulatory circuits have been suggested to behave analogously to the electronic circuits in which a wide variety of electronic elements work coordinately to function correctly. Recently, an increasing amount of next generation sequencing data provides a great resource to study regulatory activity. Thus, we developed a general-purpose computational method using logic-circuit models from electronics and applied it to a human leukemia dataset, identifying the genome-wide cooperativity of transcription factors and microRNAs.
PMCID: PMC4401777  PMID: 25884877
2.  Comparative Analysis of the Transcriptome across Distant Species 
Gerstein, Mark B. | Rozowsky, Joel | Yan, Koon-Kiu | Wang, Daifeng | Cheng, Chao | Brown, James B. | Davis, Carrie A | Hillier, LaDeana | Sisu, Cristina | Li, Jingyi Jessica | Pei, Baikang | Harmanci, Arif O. | Duff, Michael O. | Djebali, Sarah | Alexander, Roger P. | Alver, Burak H. | Auerbach, Raymond | Bell, Kimberly | Bickel, Peter J. | Boeck, Max E. | Boley, Nathan P. | Booth, Benjamin W. | Cherbas, Lucy | Cherbas, Peter | Di, Chao | Dobin, Alex | Drenkow, Jorg | Ewing, Brent | Fang, Gang | Fastuca, Megan | Feingold, Elise A. | Frankish, Adam | Gao, Guanjun | Good, Peter J. | Guigó, Roderic | Hammonds, Ann | Harrow, Jen | Hoskins, Roger A. | Howald, Cédric | Hu, Long | Huang, Haiyan | Hubbard, Tim J. P. | Huynh, Chau | Jha, Sonali | Kasper, Dionna | Kato, Masaomi | Kaufman, Thomas C. | Kitchen, Robert R. | Ladewig, Erik | Lagarde, Julien | Lai, Eric | Leng, Jing | Lu, Zhi | MacCoss, Michael | May, Gemma | McWhirter, Rebecca | Merrihew, Gennifer | Miller, David M. | Mortazavi, Ali | Murad, Rabi | Oliver, Brian | Olson, Sara | Park, Peter J. | Pazin, Michael J. | Perrimon, Norbert | Pervouchine, Dmitri | Reinke, Valerie | Reymond, Alexandre | Robinson, Garrett | Samsonova, Anastasia | Saunders, Gary I. | Schlesinger, Felix | Sethi, Anurag | Slack, Frank J. | Spencer, William C. | Stoiber, Marcus H. | Strasbourger, Pnina | Tanzer, Andrea | Thompson, Owen A. | Wan, Kenneth H. | Wang, Guilin | Wang, Huaien | Watkins, Kathie L. | Wen, Jiayu | Wen, Kejia | Xue, Chenghai | Yang, Li | Yip, Kevin | Zaleski, Chris | Zhang, Yan | Zheng, Henry | Brenner, Steven E. | Graveley, Brenton R. | Celniker, Susan E. | Gingeras, Thomas R | Waterston, Robert
Nature  2014;512(7515):445-448.
PMCID: PMC4155737  PMID: 25164755
3.  VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications 
Bioinformatics  2014;31(9):1469-1471.
Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing.
Availability and implementation: Code in Java and Python along with instructions to download the reads and variants is at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4410653  PMID: 25524895
4.  Understanding Modularity in Molecular Networks Requires Dynamics 
Science signaling  2009;2(81):pe44.
The era of genome sequencing has produced long lists of the molecular parts from which cellular machines are constructed. A fundamental goal in systems biology is to understand how cellular behavior emerges from the interaction in time and space of genetically encoded molecular parts, as well as non-genetically encoded small molecules. Networks provide a natural framework for the organization and quantitative representation of all the available data about molecular interactions. The structural and dynamic properties of molecular networks have been the subject of intense research. Despite major advances, bridging network structure to dynamics – and therefore to behavior – remains challenging. A key concept of modern engineering that recurs in the functional analysis of biological networks is modularity. Most approaches to molecular network analysis rely to some extent on the assumption that molecular networks are modular – that is, they are separable and can be studied to some degree in isolation. We describe recent advances in the analysis of modularity in biological networks, focusing on the increasing realization that a dynamic perspective is essential to grouping molecules into modules and determining their collective function.
PMCID: PMC4243459  PMID: 19638611
5.  Transcriptional Landscape of the Prenatal Human Brain 
Miller, Jeremy A. | Ding, Song-Lin | Sunkin, Susan M. | Smith, Kimberly A | Ng, Lydia | Szafer, Aaron | Ebbert, Amanda | Riley, Zackery L. | Aiona, Kaylynn | Arnold, James M. | Bennet, Crissa | Bertagnolli, Darren | Brouner, Krissy | Butler, Stephanie | Caldejon, Shiella | Carey, Anita | Cuhaciyan, Christine | Dalley, Rachel A. | Dee, Nick | Dolbeare, Tim A. | Facer, Benjamin A. C. | Feng, David | Fliss, Tim P. | Gee, Garrett | Goldy, Jeff | Gourley, Lindsey | Gregor, Benjamin W. | Gu, Guangyu | Howard, Robert E. | Jochim, Jayson M. | Kuan, Chihchau L. | Lau, Christopher | Lee, Chang-Kyu | Lee, Felix | Lemon, Tracy A. | Lesnar, Phil | McMurray, Bergen | Mastan, Naveed | Mosqueda, Nerick F. | Naluai-Cecchini, Theresa | Ngo, Nhan-Kiet | Nyhus, Julie | Oldre, Aaron | Olson, Eric | Parente, Jody | Parker, Patrick D. | Parry, Sheana E. | Player, Allison Stevens | Pletikos, Mihovil | Reding, Melissa | Royall, Joshua J. | Roll, Kate | Sandman, David | Sarreal, Melaine | Shapouri, Sheila | Shapovalova, Nadiya V. | Shen, Elaine H. | Sjoquist, Nathan | Slaughterbeck, Clifford R. | Smith, Michael | Sodt, Andy J. | Williams, Derric | Zöllei, Lilla | Fischl, Bruce | Gerstein, Mark B. | Geschwind, Daniel H. | Glass, Ian A. | Hawrylycz, Michael J. | Hevner, Robert F. | Huang, Hao | Jones, Allan R. | Knowles, James A. | Levitt, Pat | Phillips, John W. | Sestan, Nenad | Wohnoutka, Paul | Dang, Chinh | Bernard, Amy | Hohmann, John G. | Lein, Ed S.
Nature  2014;508(7495):199-206.
The anatomical and functional architecture of the human brain is largely determined by prenatal transcriptional processes. We describe an anatomically comprehensive atlas of mid-gestational human brain, including de novo reference atlases, in situ hybridization, ultra-high resolution magnetic resonance imaging (MRI) and microarray analysis on highly discrete laser microdissected brain regions. In developing cerebral cortex, transcriptional differences are found between different proliferative and postmitotic layers, wherein laminar signatures reflect cellular composition and developmental processes. Cytoarchitectural differences between human and mouse have molecular correlates, including species differences in gene expression in subplate, although surprisingly we find minimal differences between the inner and human-expanded outer subventricular zones. Both germinal and postmitotic cortical layers exhibit fronto-temporal gradients, with particular enrichment in frontal lobe. Finally, many neurodevelopmental disorder and human evolution-related genes show patterned expression, potentially underlying unique features of human cortical formation. These data provide a rich, freely-accessible resource for understanding human brain development.
PMCID: PMC4105188  PMID: 24695229
Human brain; Transcriptome; Microarray; Development; Gene expression; Evolution
6.  Architecture of the human regulatory network derived from ENCODE data 
Nature  2012;489(7414):91-100.
Transcription factors (TFs) bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 TFs in 458 ChIP-Seq experiments. We found the combinatorial, co-association of TFs to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the TF binding into a hierarchy and integrated it with other genomic information (e.g. miRNA regulation), forming a dense meta-network. Factors at different levels have different properties: for instance, top-level TFs more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs -- e.g. noise-buffering feed-forward loops. Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (i.e., differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
PMCID: PMC4154057  PMID: 22955619
7.  Performance comparison of whole-genome sequencing platforms 
Nature biotechnology  2011;30(1):78-82.
Whole-genome sequencing is becoming commonplace, but the accuracy and completeness of variant calling by the most widely used platforms from Illumina and Complete Genomics have not been reported. Here we sequenced the genome of an individual with both technologies to a high average coverage of ~76×, and compared their performance with respect to sequence coverage and calling of single-nucleotide variants (SNVs), insertions and deletions (indels). Although 88.1% of the ~3.7 million unique SNVs were concordant between platforms, there were tens of thousands of platform-specific calls located in genes and other genomic regions. In contrast, 26.5% of indels were concordant between platforms. Target enrichment validated 92.7% of the concordant SNVs, whereas validation by genotyping array revealed a sensitivity of 99.3%. The validation experiments also suggested that >60% of the platform-specific variants were indeed present in the genome. Our results have important implications for understanding the accuracy and completeness of the genome sequencing platforms.
PMCID: PMC4076012  PMID: 22178993
8.  Sixty years of genome biology 
Genome Biology  2013;14(4):113.
Sixty years after Watson and Crick published the double helix model of DNA's structure, thirteen members of Genome Biology's Editorial Board select key advances in the field of genome biology subsequent to that discovery.
PMCID: PMC3663092  PMID: 23651518
9.  Epigenetic repression of miR-31 disrupts androgen receptor homeostasis and contributes to prostate cancer progression 
Cancer research  2012;73(3):1232-1244.
Androgen receptor (AR) signaling plays a critical role in prostate cancer (PCA) pathogenesis. Yet, the regulation of AR signaling remains elusive. Even with stringent androgen deprivation therapy, AR signaling persists. Here, our data suggest that there is a complex interaction between the expression of the tumor suppressor miRNA, miR-31 and AR signaling. We examined primary and metastatic PCA and found that miR-31 expression was reduced as a result of promoter hypermethylation and importantly, the levels of miR-31 expression was inversely correlated with the aggressiveness of the disease. As the expression of AR and miR-31 was inversely correlated in the cell lines, our study further suggested that miR-31 and AR could mutually repress each other. Upregulation of miR-31 effectively suppressed AR expression through multiple mechanisms and inhibited PCA growth in vivo. Notably, we found that miR-31 targeted AR directly at a site located in the coding region, which was commonly mutated in PCA. Additionally, miR-31 suppressed cell cycle regulators, including E2F1, E2F2, EXO1, FOXM1, and MCM2. Together, our findings suggest a novel AR regulatory mechanism mediated through miR-31 expression. The downregulation of miR-31 may disrupt cellular homeostasis and contribute to the evolution and progression of PCA. We provide implications for epigenetic treatment and support clinical development of detecting miR-31 promoter methylation as a novel biomarker.
PMCID: PMC3563734  PMID: 23233736
prostate cancer; androgen receptor; miR-31; DNA hypermethylation; biomarker
10.  Molecular Characterization of Neuroendocrine Prostate Cancer and Identification of New Drug Targets 
Cancer discovery  2011;1(6):487-495.
Neuroendocrine prostate cancer (NEPC) is an aggressive subtype of prostate cancer that most commonly evolves from preexisting prostate adenocarcinoma (PCA). Using Next Generation RNA-sequencing and oligonucleotide arrays, we profiled 7 NEPC, 30 PCA, and 5 benign prostate tissue (BEN), and validated findings on tumors from a large cohort of patients (37 NEPC, 169 PCA, 22 BEN) using IHC and FISH. We discovered significant overexpression and gene amplification of AURKA and MYCN in 40% of NEPC and 5% of PCA, respectively, and evidence that that they cooperate to induce a neuroendocrine phenotype in prostate cells. There was dramatic and enhanced sensitivity of NEPC (and MYCN overexpressing PCA) to Aurora kinase inhibitor therapy both in vitro and in vivo, with complete suppression of neuroendocrine marker expression following treatment. We propose that alterations in Aurora kinase A and N-myc are involved in the development of NEPC, and future clinical trials will help determine from the efficacy of Aurora kinase inhibitor therapy.
PMCID: PMC3290518  PMID: 22389870
neuroendocrine prostate cancer; aurora kinase A; n-myc; drug targets
11.  The GENCODE pseudogene resource 
Genome Biology  2012;13(9):R51.
Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.
As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
PMCID: PMC3491395  PMID: 22951037
12.  The real cost of sequencing: higher than you think! 
Genome Biology  2011;12(8):125.
Advances in sequencing technology have led to a sharp decrease in the cost of 'data generation'. But is this sufficient to ensure cost-effective and efficient 'knowledge generation'?
PMCID: PMC3245608  PMID: 21867570
Bioinformatics; costs of sequencing; data analysis; experimental design; next-generation sequencing; sample collection
13.  A systematic survey of loss-of-function variants in human protein-coding genes 
Science (New York, N.Y.)  2012;335(6070):823-828.
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
PMCID: PMC3299548  PMID: 22344438
14.  Genomic Analysis of the Hydrocarbon-Producing, Cellulolytic, Endophytic Fungus Ascocoryne sarcoides 
PLoS Genetics  2012;8(3):e1002558.
The microbial conversion of solid cellulosic biomass to liquid biofuels may provide a renewable energy source for transportation fuels. Endophytes represent a promising group of organisms, as they are a mostly untapped reservoir of metabolic diversity. They are often able to degrade cellulose, and they can produce an extraordinary diversity of metabolites. The filamentous fungal endophyte Ascocoryne sarcoides was shown to produce potential-biofuel metabolites when grown on a cellulose-based medium; however, the genetic pathways needed for this production are unknown and the lack of genetic tools makes traditional reverse genetics difficult. We present the genomic characterization of A. sarcoides and use transcriptomic and metabolomic data to describe the genes involved in cellulose degradation and to provide hypotheses for the biofuel production pathways. In total, almost 80 biosynthetic clusters were identified, including several previously found only in plants. Additionally, many transcriptionally active regions outside of genes showed condition-specific expression, offering more evidence for the role of long non-coding RNA in gene regulation. This is one of the highest quality fungal genomes and, to our knowledge, the only thoroughly annotated and transcriptionally profiled fungal endophyte genome currently available. The analyses and datasets contribute to the study of cellulose degradation and biofuel production and provide the genomic foundation for the study of a model endophyte system.
Author Summary
A renewable source of energy is a pressing global need. The biological conversion of lignocellulose to biofuels by microorganisms presents a promising avenue, but few organisms have been studied thoroughly enough to develop the genetic tools necessary for rigorous experimentation. The filamentous-fungal endophyte A. sarcoides produces metabolites when grown on a cellulose-based medium that include eight-carbon volatile organic compounds, which are potential biofuel targets. Here we use broadly applicable methods including genomics, transcriptomics, and metabolomics to explore the biofuel production of A. sarcoides. These data were used to assemble the genome into 16 scaffolds, to thoroughly annotate the cellulose-degradation machinery, and to make predictions for the production pathway for the eight-carbon volatiles. Extremely high expression of the gene swollenin when grown on cellulose highlights the importance of accessory proteins in addition to the enzymes that catalyze the breakdown of the polymers. Correlation of the production of the eight-carbon biofuel-like metabolites with the expression of lipoxygenase pathway genes suggests the catabolism of linoleic acid as the mechanism of eight-carbon compound production. This is the first fungal genome to be sequenced in the family Helotiaceae, and A. sarcoides was isolated as an endophyte, making this work also potentially useful in fungal systematics and the study of plant–fungus relationships.
PMCID: PMC3291568  PMID: 22396667
15.  Genome-wide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors 
Genome Biology  2011;12(11):R111.
We propose a method to predict yeast transcription factor targets by integrating histone modification profiles with transcription factor binding motif information. It shows improved predictive power compared to a binding motif-only method. We find that transcription factors cluster into histone-sensitive and -insensitive classes. The target genes of histone-sensitive transcription factors have stronger histone modification signals than those of histone-insensitive ones. The two classes also differ in tendency to interact with histone modifiers, degree of connectivity in protein-protein interaction networks, position in the transcriptional regulation hierarchy, and in a number of additional features, indicating possible differences in their transcriptional regulation mechanisms.
PMCID: PMC3334597  PMID: 22060676
16.  Predicting protein ligand binding motions with the conformation explorer 
BMC Bioinformatics  2011;12:417.
Knowledge of the structure of proteins bound to known or potential ligands is crucial for biological understanding and drug design. Often the 3D structure of the protein is available in some conformation, but binding the ligand of interest may involve a large scale conformational change which is difficult to predict with existing methods.
We describe how to generate ligand binding conformations of proteins that move by hinge bending, the largest class of motions. First, we predict the location of the hinge between domains. Second, we apply an Euler rotation to one of the domains about the hinge point. Third, we compute a short-time dynamical trajectory using Molecular Dynamics to equilibrate the protein and ligand and correct unnatural atomic positions. Fourth, we score the generated structures using a novel fitness function which favors closed or holo structures. By iterating the second through fourth steps we systematically minimize the fitness function, thus predicting the conformational change required for small ligand binding for five well studied proteins.
We demonstrate that the method in most cases successfully predicts the holo conformation given only an apo structure.
PMCID: PMC3354956  PMID: 22032721
17.  The genomic complexity of primary human prostate cancer 
Nature  2011;470(7333):214-220.
Prostate cancer is the second most common cause of male cancer deaths in the United States. Here we present the complete sequence of seven primary prostate cancers and their paired normal counterparts. Several tumors contained complex chains of balanced rearrangements that occurred within or adjacent to known cancer genes. Rearrangement breakpoints were enriched near open chromatin, androgen receptor and ERG DNA binding sites in the setting of the ETS gene fusion TMPRSS2-ERG, but inversely correlated with these regions in tumors lacking ETS fusions. This observation suggests a link between chromatin or transcriptional regulation and the genesis of genomic aberrations. Three tumors contained rearrangements that disrupted CADM2, and four harbored events disrupting either PTEN (unbalanced events), a prostate tumor suppressor, or MAGI2 (balanced events), a PTEN interacting protein not previously implicated in prostate tumorigenesis. Thus, genomic rearrangements may arise from transcriptional or chromatin aberrancies to engage prostate tumorigenic mechanisms.
PMCID: PMC3075885  PMID: 21307934
18.  Mapping copy number variation by population scale genome sequencing 
Nature  2011;470(7332):59-65.
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
PMCID: PMC3077050  PMID: 21293372
19.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project 
Gerstein, Mark B. | Lu, Zhi John | Van Nostrand, Eric L. | Cheng, Chao | Arshinoff, Bradley I. | Liu, Tao | Yip, Kevin Y. | Robilotto, Rebecca | Rechtsteiner, Andreas | Ikegami, Kohta | Alves, Pedro | Chateigner, Aurelien | Perry, Marc | Morris, Mitzi | Auerbach, Raymond K. | Feng, Xin | Leng, Jing | Vielle, Anne | Niu, Wei | Rhrissorrakrai, Kahn | Agarwal, Ashish | Alexander, Roger P. | Barber, Galt | Brdlik, Cathleen M. | Brennan, Jennifer | Brouillet, Jeremy Jean | Carr, Adrian | Cheung, Ming-Sin | Clawson, Hiram | Contrino, Sergio | Dannenberg, Luke O. | Dernburg, Abby F. | Desai, Arshad | Dick, Lindsay | Dosé, Andréa C. | Du, Jiang | Egelhofer, Thea | Ercan, Sevinc | Euskirchen, Ghia | Ewing, Brent | Feingold, Elise A. | Gassmann, Reto | Good, Peter J. | Green, Phil | Gullier, Francois | Gutwein, Michelle | Guyer, Mark S. | Habegger, Lukas | Han, Ting | Henikoff, Jorja G. | Henz, Stefan R. | Hinrichs, Angie | Holster, Heather | Hyman, Tony | Iniguez, A. Leo | Janette, Judith | Jensen, Morten | Kato, Masaomi | Kent, W. James | Kephart, Ellen | Khivansara, Vishal | Khurana, Ekta | Kim, John K. | Kolasinska-Zwierz, Paulina | Lai, Eric C. | Latorre, Isabel | Leahey, Amber | Lewis, Suzanna | Lloyd, Paul | Lochovsky, Lucas | Lowdon, Rebecca F. | Lubling, Yaniv | Lyne, Rachel | MacCoss, Michael | Mackowiak, Sebastian D. | Mangone, Marco | McKay, Sheldon | Mecenas, Desirea | Merrihew, Gennifer | Miller, David M. | Muroyama, Andrew | Murray, John I. | Ooi, Siew-Loon | Pham, Hoang | Phippen, Taryn | Preston, Elicia A. | Rajewsky, Nikolaus | Rätsch, Gunnar | Rosenbaum, Heidi | Rozowsky, Joel | Rutherford, Kim | Ruzanov, Peter | Sarov, Mihail | Sasidharan, Rajkumar | Sboner, Andrea | Scheid, Paul | Segal, Eran | Shin, Hyunjin | Shou, Chong | Slack, Frank J. | Slightam, Cindie | Smith, Richard | Spencer, William C. | Stinson, E. O. | Taing, Scott | Takasaki, Teruaki | Vafeados, Dionne | Voronina, Ksenia | Wang, Guilin | Washington, Nicole L. | Whittle, Christina M. | Wu, Beijing | Yan, Koon-Kiu | Zeller, Georg | Zha, Zheng | Zhong, Mei | Zhou, Xingliang | Ahringer, Julie | Strome, Susan | Gunsalus, Kristin C. | Micklem, Gos | Liu, X. Shirley | Reinke, Valerie | Kim, Stuart K. | Hillier, LaDeana W. | Henikoff, Steven | Piano, Fabio | Snyder, Michael | Stein, Lincoln | Lieb, Jason D. | Waterston, Robert H.
Science (New York, N.Y.)  2010;330(6012):1775-1787.
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
PMCID: PMC3142569  PMID: 21177976
20.  The Reality of Pervasive Transcription 
PLoS Biology  2011;9(7):e1000625.
Despite recent controversies, the evidence that the majority of the human genome is transcribed into RNA remains strong.
PMCID: PMC3134446  PMID: 21765801
21.  Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project 
Nucleic Acids Research  2011;39(16):7058-7076.
In the human genome, it has been estimated that considerably more sequence is under natural selection in non-coding regions [such as transcription-factor binding sites (TF-binding sites) and non-coding RNAs (ncRNAs)] compared to protein-coding ones. However, less attention has been paid to them. To study selective pressure on non-coding elements, we use next-generation sequencing data from the recently completed pilot phase of the 1000 Genomes Project, which, compared to traditional methods, allows for the characterization of a full spectrum of genomic variations, including single-nucleotide polymorphisms (SNPs), short insertions and deletions (indels) and structural variations (SVs). We develop a framework for combining these variation data with non-coding elements, calculating various population-based metrics to compare classes and subclasses of elements, and developing element-aware aggregation procedures to probe the internal structure of an element. Overall, we find that TF-binding sites and ncRNAs are less selectively constrained for SNPs than coding sequences (CDSs), but more constrained than a neutral reference. We also determine that the relative amounts of constraint for the three types of variations are, in general, correlated, but there are some differences: counter-intuitively, TF-binding sites and ncRNAs are more selectively constrained for indels than for SNPs, compared to CDSs. After inspecting the overall properties of a class of elements, we analyze selective pressure on subclasses within an element class, and show that the extent of selection is associated with the genomic properties of each subclass. We find, for instance, that ncRNAs with higher expression levels tend to be under stronger purifying selection, and the actual regions of TF-binding motifs are under stronger selective pressure than the corresponding peak regions. Further, we develop element-aware aggregation plots to analyze selective pressure across the linear structure of an element, with the confidence intervals evaluated using both simple bootstrapping and block bootstrapping techniques. We find, for example, that both micro-RNAs (particularly the seed regions) and their binding targets are under stronger selective pressure for SNPs than their immediate genomic surroundings. In addition, we demonstrate that substitutions in TF-binding motifs inversely correlate with site conservation, and SNPs unfavorable for motifs are under more selective constraints than favorable SNPs. Finally, to further investigate intra-element differences, we show that SVs have the tendency to use distinctive modes and mechanisms when they interact with genomic elements, such as enveloping whole gene(s) rather than disrupting them partially, as well as duplicating TF motifs in tandem.
PMCID: PMC3167619  PMID: 21596777
22.  The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics 
Genome Biology  2011;12(3):R32.
Biological data is often tabular but finding statistically valid connections between entities in a sequence of tables can be problematic - for example, connecting particular entities in a drug property table to gene properties in a second table, using a third table associating genes with drugs. Here we present an approach (CRIT) to find connections such as these and show how it can be applied in a variety of genomic contexts including chemogenomics data.
PMCID: PMC3129682  PMID: 21453526
23.  Diverse Roles and Interactions of the SWI/SNF Chromatin Remodeling Complex Revealed Using Global Approaches 
PLoS Genetics  2011;7(3):e1002008.
A systems understanding of nuclear organization and events is critical for determining how cells divide, differentiate, and respond to stimuli and for identifying the causes of diseases. Chromatin remodeling complexes such as SWI/SNF have been implicated in a wide variety of cellular processes including gene expression, nuclear organization, centromere function, and chromosomal stability, and mutations in SWI/SNF components have been linked to several types of cancer. To better understand the biological processes in which chromatin remodeling proteins participate, we globally mapped binding regions for several components of the SWI/SNF complex throughout the human genome using ChIP-Seq. SWI/SNF components were found to lie near regulatory elements integral to transcription (e.g. 5′ ends, RNA Polymerases II and III, and enhancers) as well as regions critical for chromosome organization (e.g. CTCF, lamins, and DNA replication origins). Interestingly we also find that certain configurations of SWI/SNF subunits are associated with transcripts that have higher levels of expression, whereas other configurations of SWI/SNF factors are associated with transcripts that have lower levels of expression. To further elucidate the association of SWI/SNF subunits with each other as well as with other nuclear proteins, we also analyzed SWI/SNF immunoprecipitated complexes by mass spectrometry. Individual SWI/SNF factors are associated with their own family members, as well as with cellular constituents such as nuclear matrix proteins, key transcription factors, and centromere components, implying a ubiquitous role in gene regulation and nuclear function. We find an overrepresentation of both SWI/SNF-associated regions and proteins in cell cycle and chromosome organization. Taken together the results from our ChIP and immunoprecipitation experiments suggest that SWI/SNF facilitates gene regulation and genome function more broadly and through a greater diversity of interactions than previously appreciated.
Author Summary
Genetic information and programming are not entirely contained in DNA sequence but are also governed by chromatin structure. Gaining a greater understanding of chromatin remodeling complexes can bridge gaps between processes in the genome and the epigenome and can offer insights into diseases such as cancer. We identified targets of the chromatin remodeling complex, SWI/SNF, on a genome-wide scale using ChIP-Seq. We also identify proteins that co-purify with its various components via immunoprecipitation combined with mass spectrometry. By integrating these newly-identified regions with a combination of novel and published data sources, we identify pathways and cellular compartments in which SWI/SNF plays a major role as well as discern general characteristics of SWI/SNF target sites. Our parallel evaluations of multiple SWI/SNF factors indicate that these subunits are found in highly dynamic and combinatorial assemblies. Our study presents the first genome-wide and unified view of multiple SWI/SNF components and also provides a valuable resource to the scientific community as an important data source to be integrated with future genomic and epigenomic studies.
PMCID: PMC3048368  PMID: 21408204
24.  Deciphering Protein Kinase Specificity through Large-Scale Analysis of Yeast Phosphorylation Site Motifs 
Science signaling  2010;3(109):ra12.
Phosphorylation is a universal mechanism for regulating cell behavior in eukaryotes. Although protein kinases are known to target short linear sequence motifs on their substrates, the rules for kinase substrate recognition are not completely understood. We used a rapid peptide screening approach to determine consensus phosphorylation site motifs targeted by 61 of the 122 kinases in Saccharomyces cerevisae. Correlation of these motifs with kinase primary sequence has uncovered previously unappreciated rules for determining specificity within the kinase family, including a residue determining P−3 Arg specificity among members of the CMGC group of kinases. Furthermore, computational scanning of the yeast proteome enabled the prediction of thousands of new kinase-substrate relationships. We experimentally verified several candidate substrates of the Prk1 family of kinases in vitro and in vivo, and we identified a protein substrate of the kinase Vhs1. Together, these results elucidate how kinase catalytic domains recognize their phosphorylation targets and suggest general avenues for the identification of new kinase substrates across eukaryotes.
PMCID: PMC2846625  PMID: 20159853
25.  Measuring the Evolutionary Rewiring of Biological Networks 
PLoS Computational Biology  2011;7(1):e1001050.
We have accumulated a large amount of biological network data and expect even more to come. Soon, we anticipate being able to compare many different biological networks as we commonly do for molecular sequences. It has long been believed that many of these networks change, or “rewire”, at different rates. It is therefore important to develop a framework to quantify the differences between networks in a unified fashion. We developed such a formalism based on analogy to simple models of sequence evolution, and used it to conduct a systematic study of network rewiring on all the currently available biological networks. We found that, similar to sequences, biological networks show a decreased rate of change at large time divergences, because of saturation in potential substitutions. However, different types of biological networks consistently rewire at different rates. Using comparative genomics and proteomics data, we found a consistent ordering of the rewiring rates: transcription regulatory, phosphorylation regulatory, genetic interaction, miRNA regulatory, protein interaction, and metabolic pathway network, from fast to slow. This ordering was found in all comparisons we did of matched networks between organisms. To gain further intuition on network rewiring, we compared our observed rewirings with those obtained from simulation. We also investigated how readily our formalism could be mapped to other network contexts; in particular, we showed how it could be applied to analyze changes in a range of “commonplace” networks such as family trees, co-authorships and linux-kernel function dependencies.
Author Summary
Biological networks represent various types of molecular organizations in a cell. During evolution, molecules have been shown to change at varying rates. Therefore, it is important to investigate the evolution of biological networks in terms of network rewiring. Understanding how biological networks evolve could eventually help explain the general mechanism of cellular system. In the past decade, a large amount of high-throughput experiments have helped to unravel the different types of networks in a number of species. Recent studies have provided evolutionary rate calculations on individual networks and observed different rewiring rates between them. We have chosen a systematic approach to compare rewiring rate differences among the common types of biological networks utilizing experimental data across species. Our analysis shows that regulatory networks generally evolve faster than non-regulatory collaborative networks. Our analysis also highlights future applications of the approach to address other interesting biological questions.
PMCID: PMC3017101  PMID: 21253555

Results 1-25 (62)