PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-18 (18)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
Document Types
1.  FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer 
Genome Biology  2014;15(10):480.
Identification of noncoding drivers from thousands of somatic alterations in a typical tumor is a difficult and unsolved problem. We report a computational framework, FunSeq2, to annotate and prioritize these mutations. The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression. FunSeq2 is available from funseq2.gersteinlab.org.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0480-5) contains supplementary material, which is available to authorized users.
doi:10.1186/s13059-014-0480-5
PMCID: PMC4203974  PMID: 25273974
2.  Architecture of the human regulatory network derived from ENCODE data 
Nature  2012;489(7414):91-100.
Transcription factors (TFs) bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 TFs in 458 ChIP-Seq experiments. We found the combinatorial, co-association of TFs to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the TF binding into a hierarchy and integrated it with other genomic information (e.g. miRNA regulation), forming a dense meta-network. Factors at different levels have different properties: for instance, top-level TFs more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs -- e.g. noise-buffering feed-forward loops. Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (i.e., differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
doi:10.1038/nature11245
PMCID: PMC4154057  PMID: 22955619
4.  Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics 
Science (New York, N.Y.)  2013;342(6154):1235587.
Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations (“ultrasensitive”) and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, “motif-breakers”). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
doi:10.1126/science.1235587
PMCID: PMC3947637  PMID: 24092746
5.  Learning to swim in a sea of genomic data 
Genome Biology  2013;14(12):315.
A report on the 63rd American Society of Human Genetics (ASHG) meeting held in Boston, USA, 22–26 October 2013.
doi:10.1186/gb4144
PMCID: PMC4053704  PMID: 24314026
6.  VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment 
Bioinformatics  2012;28(17):2267-2269.
Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment.
Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org.
Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts368
PMCID: PMC3426844  PMID: 22743228
7.  Interpretation of Genomic Variants Using a Unified Biological Network Approach 
PLoS Computational Biology  2013;9(3):e1002886.
The decreasing cost of sequencing is leading to a growing repertoire of personal genomes. However, we are lagging behind in understanding the functional consequences of the millions of variants obtained from sequencing. Global system-wide effects of variants in coding genes are particularly poorly understood. It is known that while variants in some genes can lead to diseases, complete disruption of other genes, called ‘loss-of-function tolerant’, is possible with no obvious effect. Here, we build a systems-based classifier to quantitatively estimate the global perturbation caused by deleterious mutations in each gene. We first survey the degree to which gene centrality in various individual networks and a unified ‘Multinet’ correlates with the tolerance to loss-of-function mutations and evolutionary conservation. We find that functionally significant and highly conserved genes tend to be more central in physical protein-protein and regulatory networks. However, this is not the case for metabolic pathways, where the highly central genes have more duplicated copies and are more tolerant to loss-of-function mutations. Integration of three-dimensional protein structures reveals that the correlation with centrality in the protein-protein interaction network is also seen in terms of the number of interaction interfaces used. Finally, combining all the network and evolutionary properties allows us to build a classifier distinguishing functionally essential and loss-of-function tolerant genes with higher accuracy (AUC = 0.91) than any individual property. Application of the classifier to the whole genome shows its strong potential for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
Author Summary
The number of personal genomes sequenced has grown rapidly over the last few years and is likely to grow further. In order to use the DNA sequence variants amongst individuals for personalized medicine, we need to understand the functional impact of these variants. Deleterious variants in genes can have a wide spectrum of global effects, ranging from fatal for essential genes to no obvious damaging effect for loss-of-function tolerant genes. The global effect of a gene mutation is largely governed by the diverse biological networks in which the gene participates. Since genes participate in many networks, no singular network captures the global picture of gene interactions. Here we integrate the diverse modes of gene interactions (regulatory, genetic, phosphorylation, signaling, metabolic and physical protein-protein interactions) to create a unified biological network. We then exploit the unique properties of loss-of-function tolerant and essential genes in this unified network to build a computational model that can predict global perturbation caused by deleterious mutations in all genes. Our model can distinguish between these two gene sets with high accuracy and we further show that it can be used for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
doi:10.1371/journal.pcbi.1002886
PMCID: PMC3591262  PMID: 23505346
8.  A systematic survey of loss-of-function variants in human protein-coding genes 
Science (New York, N.Y.)  2012;335(6070):823-828.
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
doi:10.1126/science.1215040
PMCID: PMC3299548  PMID: 22344438
9.  Computational study of drug binding to the membrane-bound tetrameric M2 peptide bundle from influenza A virus 
Biochimica et biophysica acta  2010;1808(2):530-537.
The M2 protein of influenza A virus performs the crucial function of transporting protons to the interior of virions enclosed in the endosome. Adamantane drugs, amantadine (AMN) and rimantidine (RMN), block the proton conduction in some strains, and have been used for the treatment and prophylaxis of influenza A infections. The structures of the transmembrane (TM) region of M2 that have been solved in micelles using NMR (residues 23-60) [Schnell and Chou (2008)] and by X-ray crystallography (residues 22-46) [Stouffer et al. (2008)] suggest different drug binding sites: external and internal for RMN and AMN, respectively. We have used molecular dynamics (MD) simulations to investigate the nature of the binding site and binding mode of adamantane drugs on the membrane-bound tetrameric M2-TM peptide bundles using as initial conformations the low-pH AMN-bound crystal structure, a high-pH model derived from the drug-free crystal structure, and the high-pH NMR structure. The MD simulations indicate that under both low-and high-pH conditions, AMN is stable inside the tetrameric bundle, spanning the region between residues Val27 to Gly34. At low pH the polar group of AMN is oriented toward the His37 gate while under high-pH conditions its orientation exhibits large fluctuations. The present MD simulations also suggest that AMN and RMN molecules do not show strong affinity to the external binding sites.
doi:10.1016/j.bbamem.2010.03.025
PMCID: PMC2975046  PMID: 20385097
molecular dynamics; simulations; amantadine; adamantine; transmembrane; ion channel
10.  Mapping copy number variation by population scale genome sequencing 
Nature  2011;470(7332):59-65.
Summary
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
doi:10.1038/nature09708
PMCID: PMC3077050  PMID: 21293372
11.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project 
Gerstein, Mark B. | Lu, Zhi John | Van Nostrand, Eric L. | Cheng, Chao | Arshinoff, Bradley I. | Liu, Tao | Yip, Kevin Y. | Robilotto, Rebecca | Rechtsteiner, Andreas | Ikegami, Kohta | Alves, Pedro | Chateigner, Aurelien | Perry, Marc | Morris, Mitzi | Auerbach, Raymond K. | Feng, Xin | Leng, Jing | Vielle, Anne | Niu, Wei | Rhrissorrakrai, Kahn | Agarwal, Ashish | Alexander, Roger P. | Barber, Galt | Brdlik, Cathleen M. | Brennan, Jennifer | Brouillet, Jeremy Jean | Carr, Adrian | Cheung, Ming-Sin | Clawson, Hiram | Contrino, Sergio | Dannenberg, Luke O. | Dernburg, Abby F. | Desai, Arshad | Dick, Lindsay | Dosé, Andréa C. | Du, Jiang | Egelhofer, Thea | Ercan, Sevinc | Euskirchen, Ghia | Ewing, Brent | Feingold, Elise A. | Gassmann, Reto | Good, Peter J. | Green, Phil | Gullier, Francois | Gutwein, Michelle | Guyer, Mark S. | Habegger, Lukas | Han, Ting | Henikoff, Jorja G. | Henz, Stefan R. | Hinrichs, Angie | Holster, Heather | Hyman, Tony | Iniguez, A. Leo | Janette, Judith | Jensen, Morten | Kato, Masaomi | Kent, W. James | Kephart, Ellen | Khivansara, Vishal | Khurana, Ekta | Kim, John K. | Kolasinska-Zwierz, Paulina | Lai, Eric C. | Latorre, Isabel | Leahey, Amber | Lewis, Suzanna | Lloyd, Paul | Lochovsky, Lucas | Lowdon, Rebecca F. | Lubling, Yaniv | Lyne, Rachel | MacCoss, Michael | Mackowiak, Sebastian D. | Mangone, Marco | McKay, Sheldon | Mecenas, Desirea | Merrihew, Gennifer | Miller, David M. | Muroyama, Andrew | Murray, John I. | Ooi, Siew-Loon | Pham, Hoang | Phippen, Taryn | Preston, Elicia A. | Rajewsky, Nikolaus | Rätsch, Gunnar | Rosenbaum, Heidi | Rozowsky, Joel | Rutherford, Kim | Ruzanov, Peter | Sarov, Mihail | Sasidharan, Rajkumar | Sboner, Andrea | Scheid, Paul | Segal, Eran | Shin, Hyunjin | Shou, Chong | Slack, Frank J. | Slightam, Cindie | Smith, Richard | Spencer, William C. | Stinson, E. O. | Taing, Scott | Takasaki, Teruaki | Vafeados, Dionne | Voronina, Ksenia | Wang, Guilin | Washington, Nicole L. | Whittle, Christina M. | Wu, Beijing | Yan, Koon-Kiu | Zeller, Georg | Zha, Zheng | Zhong, Mei | Zhou, Xingliang | Ahringer, Julie | Strome, Susan | Gunsalus, Kristin C. | Micklem, Gos | Liu, X. Shirley | Reinke, Valerie | Kim, Stuart K. | Hillier, LaDeana W. | Henikoff, Steven | Piano, Fabio | Snyder, Michael | Stein, Lincoln | Lieb, Jason D. | Waterston, Robert H.
Science (New York, N.Y.)  2010;330(6012):1775-1787.
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
doi:10.1126/science.1196914
PMCID: PMC3142569  PMID: 21177976
12.  Segmental duplications in the human genome reveal details of pseudogene formation 
Nucleic Acids Research  2010;38(20):6997-7007.
Duplicated pseudogenes in the human genome are disabled copies of functioning parent genes. They result from block duplication events occurring throughout evolutionary history. Relatively recent duplications (with sequence similarity ≥90% and length ≥1 kb) are termed segmental duplications (SDs); here, we analyze the interrelationship of SDs and pseudogenes. We present a decision-tree approach to classify pseudogenes based on their (and their parents’) characteristics in relation to SDs. The classification identifies 140 novel pseudogenes and makes possible improved annotation for the 3172 pseudogenes located in SDs. In particular, it reveals that many pseudogenes in SDs likely did not arise directly from parent genes, but are the result of a multi-step process. In these cases, the initial duplication or retrotransposition of a parent gene gives rise to a ‘parent pseudogene’, followed by further duplication creating duplicated–duplicated or duplicated–processed pseudogenes, respectively. Moreover, we can precisely identify these parent pseudogenes by overlap with ancestral SD loci. Finally, a comparison of nucleotide substitutions per site in a pseudogene with its surrounding SD region allows us to estimate the time difference between duplication and disablement events, and this suggests that most duplicated pseudogenes in SDs were likely disabled around the time of the original duplication.
doi:10.1093/nar/gkq587
PMCID: PMC2978362  PMID: 20615899
13.  Using semantic web rules to reason on an ontology of pseudogenes 
Bioinformatics  2010;26(12):i71-i78.
Motivation: Recent years have seen the development of a wide range of biomedical ontologies. Notable among these is Sequence Ontology (SO) which offers a rich hierarchy of terms and relationships that can be used to annotate genomic data. Well-designed formal ontologies allow data to be reasoned upon in a consistent and logically sound way and can lead to the discovery of new relationships. The Semantic Web Rules Language (SWRL) augments the capabilities of a reasoner by allowing the creation of conditional rules. To date, however, formal reasoning, especially the use of SWRL rules, has not been widely used in biomedicine.
Results: We have built a knowledge base of human pseudogenes, extending the existing SO framework to incorporate additional attributes. In particular, we have defined the relationships between pseudogenes and segmental duplications. We then created a series of logical rules using SWRL to answer research questions and to annotate our pseudogenes appropriately. Finally, we were left with a knowledge base which could be queried to discover information about human pseudogene evolution.
Availability: The fully populated knowledge base described in this document is available for download from http://ontology.pseudogene.org. A SPARQL endpoint from which to query the dataset is also available at this location.
Contact: matthew.holford@yale.edu; mark.gerstein@yale.edu
doi:10.1093/bioinformatics/btq173
PMCID: PMC2881358  PMID: 20529940
14.  Artificial Transmembrane Oncoproteins Smaller than the Bovine Papillomavirus E5 Protein Redefine Sequence Requirements for Activation of the Platelet-Derived Growth Factor β Receptor▿†  
Journal of Virology  2009;83(19):9773-9785.
The bovine papillomavirus E5 protein (BPV E5) is a 44-amino-acid homodimeric transmembrane protein that binds directly to the transmembrane domain of the platelet-derived growth factor (PDGF) β receptor and induces ligand-independent receptor activation. Three specific features of BPV E5 are considered important for its ability to activate the PDGF β receptor and transform mouse fibroblasts: a pair of C-terminal cysteines, a transmembrane glutamine, and a juxtamembrane aspartic acid. By using a new genetic technique to screen libraries expressing artificial transmembrane proteins for activators of the PDGF β receptor, we isolated much smaller proteins, from 32 to 36 residues, that lack all three of these features yet still dimerize noncovalently, specifically activate the PDGF β receptor via its transmembrane domain, and transform cells efficiently. The primary amino acid sequence of BPV E5 is virtually unrecognizable in some of these proteins, which share as few as seven consecutive amino acids with the viral protein. Thus, small artificial proteins that bear little resemblance to a viral oncoprotein can nevertheless productively interact with the same cellular target. We speculate that similar cellular proteins may exist but have been overlooked due to their small size and hydrophobicity.
doi:10.1128/JVI.00946-09
PMCID: PMC2748040  PMID: 19605488
15.  Computational analysis of membrane proteins: the largest class of drug targets 
Drug discovery today  2009;14(23-24):1130-1135.
Given the key roles of integral membrane proteins as transporters and channels, it is necessary to understand their structures and, hence, mechanisms and regulation at the molecular level. Membrane proteins represent ~30% of all proteins of currently sequenced genomes. Paradoxically, however, only ~2% of crystal structures deposited in the protein data bank are of membrane proteins, and very few of these are at high resolution (better than 2 Å). The great disparity between our understanding of soluble proteins and our understanding of membrane proteins is because of the practical problems of working with membrane proteins – specifically, difficulties in expression, purification and crystallization. Thus, computational modeling has been utilized extensively to make crucial advances in understanding membrane protein structure and function.
doi:10.1016/j.drudis.2009.08.006
PMCID: PMC2796609  PMID: 19733256
16.  Probing Peptide Nanotube Self-Assembly at a Liquid-Liquid Interface with Coarse-Grained Molecular Dynamics 
Nano letters  2008;8(11):3626-3630.
Self-assembly at a liquid-liquid interface is a powerful experimental route to novel nanomaterials. We report herein a computational study of peptide nanotube formation at an oil-water interface. We probe interfacial self-assembly and nanotube formation of the cyclic octapeptide, cyclo [(-L-Trp-D-Leu-)4] as an illustrative example. Individual peptide rings are rapidly adsorbed at the liquid-liquid interface where they self-assemble. Monomeric and dimeric peptide rings lie with their molecular planes mostly parallel to the interface. Longer oligomeric nanotubes are increasingly tilted at the interface and grow by an Oswald ripening mechanism to eventually align their tube axis parallel to the interface. The present results on nanotube assembly suggest that computation will be a useful complement to experiment in understanding the nature of self-assembly of nanomaterials at liquid-liquid interfaces.
doi:10.1021/nl801564m
PMCID: PMC2696305  PMID: 18855461
17.  Comprehensive analysis of the pseudogenes of glycolytic enzymes in vertebrates: the anomalously high number of GAPDH pseudogenes highlights a recent burst of retrotrans-positional activity 
BMC Genomics  2009;10:480.
Background
Pseudogenes provide a record of the molecular evolution of genes. As glycolysis is such a highly conserved and fundamental metabolic pathway, the pseudogenes of glycolytic enzymes comprise a standardized genomic measuring stick and an ideal platform for studying molecular evolution. One of the glycolytic enzymes, glyceraldehyde-3-phosphate dehydrogenase (GAPDH), has already been noted to have one of the largest numbers of associated pseudogenes, among all proteins.
Results
We assembled the first comprehensive catalog of the processed and duplicated pseudogenes of glycolytic enzymes in many vertebrate model-organism genomes, including human, chimpanzee, mouse, rat, chicken, zebrafish, pufferfish, fruitfly, and worm (available at ). We found that glycolytic pseudogenes are predominantly processed, i.e. retrotransposed from the mRNA of their parent genes. Although each glycolytic enzyme plays a unique role, GAPDH has by far the most pseudogenes, perhaps reflecting its large number of non-glycolytic functions or its possession of a particularly retrotranspositionally active sub-sequence. Furthermore, the number of GAPDH pseudogenes varies significantly among the genomes we studied: none in zebrafish, pufferfish, fruitfly, and worm, 1 in chicken, 50 in chimpanzee, 62 in human, 331 in mouse, and 364 in rat. Next, we developed a simple method of identifying conserved syntenic blocks (consistently applicable to the wide range of organisms in the study) by using orthologous genes as anchors delimiting a conserved block between a pair of genomes. This approach showed that few glycolytic pseudogenes are shared between primate and rodent lineages. Finally, by estimating pseudogene ages using Kimura's two-parameter model of nucleotide substitution, we found evidence for bursts of retrotranspositional activity approximately 42, 36, and 26 million years ago in the human, mouse, and rat lineages, respectively.
Conclusion
Overall, we performed a consistent analysis of one group of pseudogenes across multiple genomes, finding evidence that most of them were created within the last 50 million years, subsequent to the divergence of rodent and primate lineages.
doi:10.1186/1471-2164-10-480
PMCID: PMC2770531  PMID: 19835609
18.  Pseudofam: the pseudogene families database 
Nucleic Acids Research  2008;37(Database issue):D738-D743.
Pseudofam (http://pseudofam.pseudogene.org) is a database of pseudogene families based on the protein families from the Pfam database. It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. The current version of Pseudofam contains more than 125 000 pseudogenes identified from 10 eukaryotic genomes and aligned within nearly 3000 families (approximately one-third of the total families in PfamA). Pseudofam uses a large-scale parallelized homology search algorithm (implemented as an extension of the PseudoPipe pipeline) to identify pseudogenes. Each identified pseudogene is assigned to its parent protein family and subsequently aligned to each other by transferring the parent domain alignments from the Pfam family. Pseudogenes are also given additional annotation based on an ontology, reflecting their mode of creation and subsequent history. In particular, our annotation highlights the association of pseudogene families with genomic features, such as segmental duplications. In addition, pseudogene families are associated with key statistics, which identify outlier families with an unusual degree of pseudogenization. The statistics also show how the number of genes and pseudogenes in families correlates across different species. Overall, they highlight the fact that housekeeping families tend to be enriched with a large number of pseudogenes.
doi:10.1093/nar/gkn758
PMCID: PMC2686518  PMID: 18957444

Results 1-18 (18)