Chen, Rui | Mias, George I. | Li-Pook-Than, Jennifer | Jiang, Lihua | Lam, Hugo Y. K. | Chen, Rong | Miriami, Elana | Karczewski, Konrad J. | Hariharan, Manoj | Dewey, Frederick E. | Cheng, Yong | Clark, Michael J. | Im, Hogune | Habegger, Lukas | Balasubramanian, Suganthi | O'Huallachain, Maeve | Dudley, Joel T. | Hillenmeyer, Sara | Haraksingh, Rajini | Sharon, Donald | Euskirchen, Ghia | Lacroute, Phil | Bettinger, Keith | Boyle, Alan P. | Kasowski, Maya | Grubert, Fabian | Seki, Scott | Garcia, Marco | Whirl-Carrillo, Michelle | Gallardo, Mercedes | Blasco, Maria A. | Greenberg, Peter L. | Snyder, Phyllis | Klein, Teri E. | Altman, Russ B. | Butte, Atul | Ashley, Euan A. | Nadeau, Kari C. | Gerstein, Mark | Tang, Hua | Snyder, Michael
Cell
2012;148(6):1293-1307.
SUMMARY
Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here we present an integrative Personal Omics Profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14-month period. Our iPOP analysis revealed various medical risks, including Type II diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high coverage genomic and transcriptomic data, which provide the basis of our iPOP, discovered extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and disease states by connecting genomic information with additional dynamic omics activity.
doi:10.1016/j.cell.2012.02.009
PMCID: PMC3341616
PMID: 22424236
The decreasing cost of sequencing is leading to a growing repertoire of personal genomes. However, we are lagging behind in understanding the functional consequences of the millions of variants obtained from sequencing. Global system-wide effects of variants in coding genes are particularly poorly understood. It is known that while variants in some genes can lead to diseases, complete disruption of other genes, called ‘loss-of-function tolerant’, is possible with no obvious effect. Here, we build a systems-based classifier to quantitatively estimate the global perturbation caused by deleterious mutations in each gene. We first survey the degree to which gene centrality in various individual networks and a unified ‘Multinet’ correlates with the tolerance to loss-of-function mutations and evolutionary conservation. We find that functionally significant and highly conserved genes tend to be more central in physical protein-protein and regulatory networks. However, this is not the case for metabolic pathways, where the highly central genes have more duplicated copies and are more tolerant to loss-of-function mutations. Integration of three-dimensional protein structures reveals that the correlation with centrality in the protein-protein interaction network is also seen in terms of the number of interaction interfaces used. Finally, combining all the network and evolutionary properties allows us to build a classifier distinguishing functionally essential and loss-of-function tolerant genes with higher accuracy (AUC = 0.91) than any individual property. Application of the classifier to the whole genome shows its strong potential for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
Author Summary
The number of personal genomes sequenced has grown rapidly over the last few years and is likely to grow further. In order to use the DNA sequence variants amongst individuals for personalized medicine, we need to understand the functional impact of these variants. Deleterious variants in genes can have a wide spectrum of global effects, ranging from fatal for essential genes to no obvious damaging effect for loss-of-function tolerant genes. The global effect of a gene mutation is largely governed by the diverse biological networks in which the gene participates. Since genes participate in many networks, no singular network captures the global picture of gene interactions. Here we integrate the diverse modes of gene interactions (regulatory, genetic, phosphorylation, signaling, metabolic and physical protein-protein interactions) to create a unified biological network. We then exploit the unique properties of loss-of-function tolerant and essential genes in this unified network to build a computational model that can predict global perturbation caused by deleterious mutations in all genes. Our model can distinguish between these two gene sets with high accuracy and we further show that it can be used for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
doi:10.1371/journal.pcbi.1002886
PMCID: PMC3591262
PMID: 23505346
Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning. Here we use long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and generate two large datasets in the human K562 and HeLa S3 cell lines. Both data sets comprised at least 4 million reads and had median read lengths greater than 500 bp. We show that annotation-independent alignments of these reads provide partial gene structures that are very much in-line with annotated gene structures, 15% of which have not been obtained in a previous de novo analysis of short reads. For long-noncoding RNAs (i.e., lncRNA) genes, however, we find an increased fraction of novel gene structures among our alignments. Other important aspects of transcriptome analysis, such as the description of cell type-specific splicing, can be performed in an accurate, reliable and completely annotation-free manner, making it ideal for the analysis of transcriptomes of newly sequenced genomes. Furthermore, we demonstrate that long read sequence can be assembled into full-length transcripts with considerable success. Our method is applicable to all long read sequencing technologies.
doi:10.1534/g3.112.004812
PMCID: PMC3583448
PMID: 23450794
RNA; Roche sequencing; human; splicing; transcriptome
Li, Guoliang | Ruan, Xiaoan | Auerbach, Raymond K. | Sandhu, Kuljeet Singh | Zheng, Meizhen | Wang, Ping | Poh, Huay Mei | Goh, Yufen | Lim, Joanne | Zhang, Jingyao | Sim, Hui Shan | Peh, Su Qin | Mulawadi, Fabianus Hendriyan | Ong, Chin Thing | Orlov, Yuriy L. | Hong, Shuzhen | Zhang, Zhizhuo | Landt, Steve | Raha, Debasish | Euskirchen, Ghia | Wei, Chia-Lin | Ge, Weihong | Wang, Huaien | Davis, Carrie | Fisher, Katherine | Mortazavi, Ali | Gerstein, Mark | Gingeras, Thomas | Wold, Barbara | Sun, Yi | Fullwood, Melissa J. | Cheung, Edwin | Liu, Edison | Sung, Wing-Kin | Snyder, Michael | Ruan, Yijun
Cell
2012;148(1-2):84-98.
Summary
Higher-order chromosomal organization for transcription regulation is poorly understood in eukaryotes. Using genome-wide Chromatin Interaction Analysis with Paired-End-Tag sequencing (ChIA-PET), we mapped long-range chromatin interactions associated with RNA polymerase II in human cells and uncovered widespread promoter-centered intra-genic, extra-genic and inter-genic interactions. These interactions further aggregated into higher-order clusters, wherein proximal and distal genes were engaged through promoter-promoter interactions. Most genes with promoter-promoter interactions were active and transcribed cooperatively, and some interacting promoters could influence each other implying combinatorial complexity of transcriptional controls. Comparative analyses of different cell lines showed that cell-specific chromatin interactions could provide structural frameworks for cell-specific transcription, and suggested significant enrichment of enhancer-promoter interactions for cell-specific functions. Furthermore, genetically-identified disease-associated non-coding elements were found to be spatially engaged with corresponding genes through long-range interactions. Overall, our study provides insights into the transcription regulation by three-dimensional chromatin interactions for both housekeeping and cell-specific genes in human cells.
doi:10.1016/j.cell.2011.12.014
PMCID: PMC3339270
PMID: 22265404
Motivation: ChIP-seq and ChIP-chip experiments have been widely used to identify transcription factor (TF) binding sites and target genes. Conventionally, a fairly ‘simple’ approach is employed for target gene identification e.g. finding genes with binding sites within 2 kb of a transcription start site (TSS). However, this does not take into account the number of sites upstream of the TSS, their exact positioning or the fact that different TFs appear to act at different characteristic distances from the TSS.
Results: Here we propose a probabilistic model called target identification from profiles (TIP) that quantitatively measures the regulatory relationships between TFs and target genes. For each TF, our model builds a characteristic, averaged profile of binding around the TSS and then uses this to weight the sites associated with a given gene, providing a continuous-valued ‘regulatory’ score relating each TF and potential target. Moreover, the score can readily be turned into a ranked list of target genes and an estimate of significance, which is useful for case-dependent downstream analysis.
Conclusion: We show the advantages of TIP by comparing it to the ‘simple’ approach on several representative datasets, using motif occurrence and relationship to knock-out experiments as metrics of validation. Moreover, we show that the probabilistic model is not as sensitive to various experimental parameters (including sequencing depth and peak-calling method) as the simple approach; in fact, the lesser dependence on sequencing depth potentially utilizes the result of a ChIP-seq experiment in a more ‘cost-effective’ manner.
Contact: mark.gerstein@yale.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr552
PMCID: PMC3223362
PMID: 22039215
Beltran, Himisha | Rickman, David S. | Park, Kyung | Chae, Sung Suk | Sboner, Andrea | MacDonald, Theresa Y. | Wang, Yuwei | Sheikh, Karen L. | Terry, Stéphane | Tagawa, Scott T | Dhir, Rajiv | Nelson, Joel B. | de la Taille, Alexandre | Allory, Yves | Gerstein, Mark B. | Perner, Sven | Pienta, Kenneth J. | Chinnaiyan, Arul M. | Wang, Yuzhuo | Collins, Colin C. | Gleave, Martin E. | Demichelis, Francesca | Nanus, David M. | Rubin, Mark A.
Neuroendocrine prostate cancer (NEPC) is an aggressive subtype of prostate cancer that most commonly evolves from preexisting prostate adenocarcinoma (PCA). Using Next Generation RNA-sequencing and oligonucleotide arrays, we profiled 7 NEPC, 30 PCA, and 5 benign prostate tissue (BEN), and validated findings on tumors from a large cohort of patients (37 NEPC, 169 PCA, 22 BEN) using IHC and FISH. We discovered significant overexpression and gene amplification of AURKA and MYCN in 40% of NEPC and 5% of PCA, respectively, and evidence that that they cooperate to induce a neuroendocrine phenotype in prostate cells. There was dramatic and enhanced sensitivity of NEPC (and MYCN overexpressing PCA) to Aurora kinase inhibitor therapy both in vitro and in vivo, with complete suppression of neuroendocrine marker expression following treatment. We propose that alterations in Aurora kinase A and N-myc are involved in the development of NEPC, and future clinical trials will help determine from the efficacy of Aurora kinase inhibitor therapy.
doi:10.1158/2159-8290.CD-11-0130
PMCID: PMC3290518
PMID: 22389870
neuroendocrine prostate cancer; aurora kinase A; n-myc; drug targets
Yip, Kevin Y | Cheng, Chao | Bhardwaj, Nitin | Brown, James B | Leng, Jing | Kundaje, Anshul | Rozowsky, Joel | Birney, Ewan | Bickel, Peter | Snyder, Michael | Gerstein, Mark
Background
Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.
Results
As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.
Conclusions
Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
doi:10.1186/gb-2012-13-9-r48
PMCID: PMC3491392
PMID: 22950945
Pei, Baikang | Sisu, Cristina | Frankish, Adam | Howald, Cédric | Habegger, Lukas | Mu, Xinmeng Jasmine | Harte, Rachel | Balasubramanian, Suganthi | Tanzer, Andrea | Diekhans, Mark | Reymond, Alexandre | Hubbard, Tim J | Harrow, Jennifer | Gerstein, Mark B
Background
Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results
As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions
At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
doi:10.1186/gb-2012-13-9-r51
PMCID: PMC3491395
PMID: 22951037
Dong, Xianjun | Greven, Melissa C | Kundaje, Anshul | Djebali, Sarah | Brown, James B | Cheng, Chao | Gingeras, Thomas R | Gerstein, Mark | Guigó, Roderic | Birney, Ewan | Weng, Zhiping
Background
Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.
Results
We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.
Conclusions
Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.
doi:10.1186/gb-2012-13-9-r53
PMCID: PMC3491397
PMID: 22950368
Advances in sequencing technology have led to a sharp decrease in the cost of 'data generation'. But is this sufficient to ensure cost-effective and efficient 'knowledge generation'?
doi:10.1186/gb-2011-12-8-125
PMCID: PMC3245608
PMID: 21867570
Bioinformatics; costs of sequencing; data analysis; experimental design; next-generation sequencing; sample collection
A National Institutes of Health (NIH) workshop was convened in Bethesda, MD on September 26–27, 2011, with representative scientific leaders in the field of proteomics and its applications to clinical settings. The main purpose of this workshop was to articulate ways in which the biomedical research community can capitalize on recent technology advances and synergize with ongoing efforts to advance the field of human proteomics. This executive summary and the following full report describe the main discussions and outcomes of the workshop.
doi:10.1186/1559-0275-9-6
PMCID: PMC3388576
PMID: 22583803
The study of the developing brain has begun to shed light on the underpinnings of both early and adult onset neuropsychiatric disorders. Neuroimaging of the human brain across developmental time points and the use of model animal systems have combined to reveal brain systems and gene products that may play a role in autism spectrum disorders, attention deficit hyperactivity disorder, obsessive compulsive disorder and many other neurodevelopmental conditions. However, precisely how genes may function in human brain development and how they interact with each other leading to psychiatric disorders is unknown. Because of an increasing understanding of neural stem cells and how the nervous system subsequently develops from these cells, we have now the ability to study disorders of the nervous system in a new way—by rewinding and reviewing the development of human neural cells. Induced pluripotent stem cells (iPSCs), developed from mature somatic cells, have allowed the development of specific cells in patients to be observed in real-time. Moreover, they have allowed some neuronal-specific abnormalities to be corrected with pharmacological intervention in tissue culture. These exciting advances based on the use of iPSCs hold great promise for understanding, diagnosing and, possibly, treating psychiatric disorders. Specifically, examination of iPSCs from typically developing individuals will reveal how basic cellular processes and genetic differences contribute to individually unique nervous systems. Moreover, by comparing iPSCs from typically developing individuals and patients, differences at stem cell stages, through neural differentiation, and into the development of functional neurons may be identified that will reveal opportunities for intervention. The application of such techniques to early onset neuropsychiatric disorders is still on the horizon but has become a reality of current research efforts as a consequence of the revelations of many years of basic developmental neurobiological science.
doi:10.1111/j.1469-7610.2010.02348.x
PMCID: PMC3124336
PMID: 21204834
Wu, Jia Qian | Seay, Montrell | Schulz, Vincent P. | Hariharan, Manoj | Tuck, David | Lian, Jin | Du, Jiang | Shi, Minyi | Ye, Zhijia | Gerstein, Mark | Snyder, Michael P. | Weissman, Sherman | Copenhaver, Gregory P.
A critical problem in biology is understanding how cells choose between self-renewal and differentiation. To generate a comprehensive view of the mechanisms controlling early hematopoietic precursor self-renewal and differentiation, we used systems-based approaches and murine EML multipotential hematopoietic precursor cells as a primary model. EML cells give rise to a mixture of self-renewing Lin-SCA+CD34+ cells and partially differentiated non-renewing Lin-SCA-CD34− cells in a cell autonomous fashion. We identified and validated the HMG box protein TCF7 as a regulator in this self-renewal/differentiation switch that operates in the absence of autocrine Wnt signaling. We found that Tcf7 is the most down-regulated transcription factor when CD34+ cells switch into CD34− cells, using RNA–Seq. We subsequently identified the target genes bound by TCF7, using ChIP–Seq. We show that TCF7 and RUNX1 (AML1) bind to each other's promoter regions and that TCF7 is necessary for the production of the short isoforms, but not the long isoforms of RUNX1, suggesting that TCF7 and the short isoforms of RUNX1 function coordinately in regulation. Tcf7 knock-down experiments and Gene Set Enrichment Analyses suggest that TCF7 plays a dual role in promoting the expression of genes characteristic of self-renewing CD34+ cells while repressing genes activated in partially differentiated CD34− state. Finally a network of up-regulated transcription factors of CD34+ cells was constructed. Factors that control hematopoietic stem cell (HSC) establishment and development, cell growth, and multipotency were identified. These studies in EML cells demonstrate fundamental cell-intrinsic properties of the switch between self-renewal and differentiation, and yield valuable insights for manipulating HSCs and other differentiating systems.
Author Summary
The hematopoietic system has provided a leading model for stem cell studies, and there is great interest in elucidating the mechanisms that control the decision of HSC self-renewal and differentiation. This switch is important for understanding hematopoietic diseases and manipulating HSCs for therapeutic purposes. However, because HSCs are currently unable to proliferate extensively in vitro, this severely limits the types of biochemical analyses that can be performed; and, consequently, the mechanisms that control the decision between early-stage HSC self-renewal and differentiation remain unclear. Murine bone marrow derived EML multipotential hematopoietic precursor cells are ideal for studying the switch. EML cells can grow in large culture and give rise to a mixture of self-renewing Lin-SCA+CD34+ cells and partially differentiated non-renewing Lin-SCA-CD34− cells in a cell autonomous fashion. Using RNA–Sequencing and ChIP–Sequencing, we identified and validated the HMG box protein TCF7 as a regulator in this switch and find that it operates in the absence of canonical Wnt signaling. Together with RUNX1, TCF7 regulates a network of transcription factors that characterize the CD34+ cell state. This work serves as a model for studying mechanisms of autonomous and balanced cell fate choice and is ultimately valuable for manipulating HSCs.
doi:10.1371/journal.pgen.1002565
PMCID: PMC3297581
PMID: 22412390
Gianoulis, Tara A. | Griffin, Meghan A. | Spakowicz, Daniel J. | Dunican, Brian F. | Alpha, Cambria J. | Sboner, Andrea | Sismour, A. Michael | Kodira, Chinnappa | Egholm, Michael | Church, George M. | Gerstein, Mark B. | Strobel, Scott A. | Monchy, Sébastien
The microbial conversion of solid cellulosic biomass to liquid biofuels may provide a renewable energy source for transportation fuels. Endophytes represent a promising group of organisms, as they are a mostly untapped reservoir of metabolic diversity. They are often able to degrade cellulose, and they can produce an extraordinary diversity of metabolites. The filamentous fungal endophyte Ascocoryne sarcoides was shown to produce potential-biofuel metabolites when grown on a cellulose-based medium; however, the genetic pathways needed for this production are unknown and the lack of genetic tools makes traditional reverse genetics difficult. We present the genomic characterization of A. sarcoides and use transcriptomic and metabolomic data to describe the genes involved in cellulose degradation and to provide hypotheses for the biofuel production pathways. In total, almost 80 biosynthetic clusters were identified, including several previously found only in plants. Additionally, many transcriptionally active regions outside of genes showed condition-specific expression, offering more evidence for the role of long non-coding RNA in gene regulation. This is one of the highest quality fungal genomes and, to our knowledge, the only thoroughly annotated and transcriptionally profiled fungal endophyte genome currently available. The analyses and datasets contribute to the study of cellulose degradation and biofuel production and provide the genomic foundation for the study of a model endophyte system.
Author Summary
A renewable source of energy is a pressing global need. The biological conversion of lignocellulose to biofuels by microorganisms presents a promising avenue, but few organisms have been studied thoroughly enough to develop the genetic tools necessary for rigorous experimentation. The filamentous-fungal endophyte A. sarcoides produces metabolites when grown on a cellulose-based medium that include eight-carbon volatile organic compounds, which are potential biofuel targets. Here we use broadly applicable methods including genomics, transcriptomics, and metabolomics to explore the biofuel production of A. sarcoides. These data were used to assemble the genome into 16 scaffolds, to thoroughly annotate the cellulose-degradation machinery, and to make predictions for the production pathway for the eight-carbon volatiles. Extremely high expression of the gene swollenin when grown on cellulose highlights the importance of accessory proteins in addition to the enzymes that catalyze the breakdown of the polymers. Correlation of the production of the eight-carbon biofuel-like metabolites with the expression of lipoxygenase pathway genes suggests the catabolism of linoleic acid as the mechanism of eight-carbon compound production. This is the first fungal genome to be sequenced in the family Helotiaceae, and A. sarcoides was isolated as an endophyte, making this work also potentially useful in fungal systematics and the study of plant–fungus relationships.
doi:10.1371/journal.pgen.1002558
PMCID: PMC3291568
PMID: 22396667
With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.
doi:10.1371/journal.pone.0029175
PMCID: PMC3253133
PMID: 22238592
Open source and open data have been driving forces in bioinformatics in the past. However, privacy concerns may soon change the landscape, limiting future access to important data sets, including personal genomics data. Here we survey this situation in some detail, describing, in particular, how the large scale of the data from personal genomic sequencing makes it especially hard to share data, exacerbating the privacy problem. We also go over various aspects of genomic privacy: first, there is basic identifiability of subjects having their genome sequenced. However, even for individuals who have consented to be identified, there is the prospect of very detailed future characterization of their genotype, which, unanticipated at the time of their consent, may be more personal and invasive than the release of their medical records. We go over various computational strategies for dealing with the issue of genomic privacy. One can “slice” and reformat datasets to allow them to be partially shared while securing the most private variants. This is particularly applicable to functional genomics information, which can be largely processed without variant information. For handling the most private data there are a number of legal and technological approaches—for example, modifying the informed consent procedure to acknowledge that privacy cannot be guaranteed, and/or employing a secure cloud computing environment. Cloud computing in particular may allow access to the data in a more controlled fashion than the current practice of downloading and computing on large datasets. Furthermore, it may be particularly advantageous for small labs, given that the burden of many privacy issues falls disproportionately on them in comparison to large corporations and genome centers. Finally, we discuss how education of future genetics researchers will be important, with curriculums emphasizing privacy and data security. However, teaching personal genomics with identifiable subjects in the university setting will, in turn, create additional privacy issues and social conundrums.
doi:10.1371/journal.pcbi.1002278
PMCID: PMC3228779
PMID: 22144881
Accurate and efficient genome-wide detection of copy number variants (CNVs) is essential for understanding human genomic variation, genome-wide CNV association type studies, cytogenetics research and diagnostics, and independent validation of CNVs identified from sequencing based technologies. Numerous, array-based platforms for CNV detection exist utilizing array Comparative Genome Hybridization (aCGH), Single Nucleotide Polymorphism (SNP) genotyping or both. We have quantitatively assessed the abilities of twelve leading genome-wide CNV detection platforms to accurately detect Gold Standard sets of CNVs in the genome of HapMap CEU sample NA12878, and found significant differences in performance. The technologies analyzed were the NimbleGen 4.2 M, 2.1 M and 3×720 K Whole Genome and CNV focused arrays, the Agilent 1×1 M CGH and High Resolution and 2×400 K CNV and SNP+CGH arrays, the Illumina Human Omni1Quad array and the Affymetrix SNP 6.0 array. The Gold Standards used were a 1000 Genomes Project sequencing-based set of 3997 validated CNVs and an ultra high-resolution aCGH-based set of 756 validated CNVs. We found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays. Our results are important for cost effective CNV detection and validation for both basic and clinical applications.
doi:10.1371/journal.pone.0027859
PMCID: PMC3227574
PMID: 22140474
Cheng, Chao | Yan, Koon-Kiu | Hwang, Woochang | Qian, Jiang | Bhardwaj, Nitin | Rozowsky, Joel | Lu, Zhi John | Niu, Wei | Alves, Pedro | Kato, Masaomi | Snyder, Michael | Gerstein, Mark | Price, Nathan D.
We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.
Author Summary
The precise control of gene expression lies at the heart of many biological processes. In eukaryotes, the regulation is performed at multiple levels, mediated by different regulators such as transcription factors and miRNAs, each distinguished by different spatial and temporal characteristics. These regulators are further integrated to form a complex regulatory network responsible for the orchestration. The construction and analysis of such networks is essential for understanding the general design principles. Recent advances in high-throughput techniques like ChIP-Seq and RNA-Seq provide an opportunity by offering a huge amount of binding and expression data. We present a general framework to combine these types of data into an integrated network and perform various topological analyses, including its hierarchical organization and motif enrichment. We find that the integrated network possesses an intrinsic hierarchical organization and is enriched in several network motifs that include both transcription factors and miRNAs. We further demonstrate that the framework can be easily applied to other species like human and mouse. As more and more genome-wide ChIP-Seq and RNA-Seq data are going to be generated in the near future, our methods of data integration have various potential applications.
doi:10.1371/journal.pcbi.1002190
PMCID: PMC3219617
PMID: 22125477
Cell
2010;143(4):639-650.
Summary
Natural small compounds comprise most cellular molecules and bind proteins as substrates, products, cofactors and ligands. However, a large scale investigation of in vivo protein-small metabolite interactions has not been performed. We developed a mass spectrometry assay for the large scale identification of in vivo protein-hydrophobic small metabolite interactions in yeast and analyzed compounds that bind ergosterol biosynthetic proteins and protein kinases. Many of these proteins bind small metabolites; a few interactions were previously known, but the vast majority are novel. Importantly, many key regulatory proteins such as protein kinases bind metabolites. Ergosterol was found to bind many proteins and may function as a general regulator. It is required for the activity of Ypk1, a mammalian AKT/SGK1 kinase homolog. Our study defines potential key regulatory steps in lipid biosynthetic pathways and suggests small metabolites may play a more general role as regulators of protein activity and function than previously appreciated.
doi:10.1016/j.cell.2010.09.048
PMCID: PMC3005334
PMID: 21035178
We propose a method to predict yeast transcription factor targets by integrating histone modification profiles with transcription factor binding motif information. It shows improved predictive power compared to a binding motif-only method. We find that transcription factors cluster into histone-sensitive and -insensitive classes. The target genes of histone-sensitive transcription factors have stronger histone modification signals than those of histone-insensitive ones. The two classes also differ in tendency to interact with histone modifiers, degree of connectivity in protein-protein interaction networks, position in the transcriptional regulation hierarchy, and in a number of additional features, indicating possible differences in their transcriptional regulation mechanisms.
doi:10.1186/gb-2011-12-11-r111
PMCID: PMC3334597
PMID: 22060676
Background
Knowledge of the structure of proteins bound to known or potential ligands is crucial for biological understanding and drug design. Often the 3D structure of the protein is available in some conformation, but binding the ligand of interest may involve a large scale conformational change which is difficult to predict with existing methods.
Results
We describe how to generate ligand binding conformations of proteins that move by hinge bending, the largest class of motions. First, we predict the location of the hinge between domains. Second, we apply an Euler rotation to one of the domains about the hinge point. Third, we compute a short-time dynamical trajectory using Molecular Dynamics to equilibrate the protein and ligand and correct unnatural atomic positions. Fourth, we score the generated structures using a novel fitness function which favors closed or holo structures. By iterating the second through fourth steps we systematically minimize the fitness function, thus predicting the conformational change required for small ligand binding for five well studied proteins.
Conclusions
We demonstrate that the method in most cases successfully predicts the holo conformation given only an apo structure.
doi:10.1186/1471-2105-12-417
PMCID: PMC3354956
PMID: 22032721
Nègre, Nicolas | Brown, Christopher D. | Ma, Lijia | Bristow, Christopher Aaron | Miller, Steven W. | Wagner, Ulrich | Kheradpour, Pouya | Eaton, Matthew L. | Loriaux, Paul | Sealfon, Rachel | Li, Zirong | Ishii, Haruhiko | Spokony, Rebecca F. | Chen, Jia | Hwang, Lindsay | Cheng, Chao | Auburn, Richard P. | Davis, Melissa B. | Domanus, Marc | Shah, Parantu K. | Morrison, Carolyn A. | Zieba, Jennifer | Suchy, Sarah | Senderowicz, Lionel | Victorsen, Alec | Bild, Nicholas A. | Grundstad, A. Jason | Hanley, David | MacAlpine, David M. | Mannervik, Mattias | Venken, Koen | Bellen, Hugo | White, Robert | Russell, Steven | Grossman, Robert L. | Ren, Bing | Gerstein, Mark | Posakony, James W. | Kellis, Manolis | White, Kevin P.
Nature
2011;471(7339):527-531.
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.
doi:10.1038/nature09990
PMCID: PMC3179250
PMID: 21430782
Transcription factor (TF) binding and histone modification (HM) are important for the precise control of gene expression. Hence, we constructed statistical models to relate these to gene expression levels in mouse embryonic stem cells. While both TF binding and HMs are highly ‘predictive’ of gene expression levels (in a statistical, but perhaps not strictly mechanistic, sense), we find they show distinct differences in the spatial patterning of their predictive strength: TF binding achieved the highest predictive power in a small DNA region centered at the transcription start sites of genes, while the HMs exhibited high predictive powers across a wide region around genes. Intriguingly, our results suggest that TF binding and HMs are redundant in strict statistical sense for predicting gene expression. We also show that our TF and HM models are cell line specific; specifically, TF binding and HM are more predictive of gene expression in the same cell line, and the differential gene expression between cell lines is predictable by differential HMs. Finally, we found that the models trained solely on protein-coding genes are predictive of expression levels of microRNAs, suggesting that their regulation by TFs and HMs may share a similar mechanism to that for protein-coding genes.
doi:10.1093/nar/gkr752
PMCID: PMC3258143
PMID: 21926158
Background
Peptide Recognition Domains (PRDs) are commonly found in signaling proteins. They mediate protein-protein interactions by recognizing and binding short motifs in their ligands. Although a great deal is known about PRDs and their interactions, prediction of PRD specificities remains largely an unsolved problem.
Results
We present a novel approach to identifying these Specificity Determining Residues (SDRs). Our algorithm generalizes earlier information theoretic approaches to coevolution analysis, to become applicable to this problem. It leverages the growing wealth of binding data between PRDs and large numbers of random peptides, and searches for PRD residues that exhibit strong evolutionary covariation with some positions of the statistical profiles of bound peptides. The calculations involve only information from sequences, and thus can be applied to PRDs without crystal structures. We applied the approach to PDZ, SH3 and kinase domains, and evaluated the results using both residue proximity in co-crystal structures and verified binding specificity maps from mutagenesis studies.
Discussion
Our predictions were found to be strongly correlated with the physical proximity of residues, demonstrating the ability of our approach to detect physical interactions of the binding partners. Some high-scoring pairs were further confirmed to affect binding specificity using previous experimental results. Combining the covariation results also allowed us to predict binding profiles with higher reliability than two other methods that do not explicitly take residue covariation into account.
Conclusions
The general applicability of our approach to the three different domain families demonstrated in this paper suggests its potential in predicting binding targets and assisting the exploration of binding mechanisms.
doi:10.1186/1741-7007-9-53
PMCID: PMC3224579
PMID: 21835011
Berger, Michael F. | Lawrence, Michael S. | Demichelis, Francesca | Drier, Yotam | Cibulskis, Kristian | Sivachenko, Andrey Y. | Sboner, Andrea | Esgueva, Raquel | Pflueger, Dorothee | Sougnez, Carrie | Onofrio, Robert | Carter, Scott L. | Park, Kyung | Habegger, Lukas | Ambrogio, Lauren | Fennell, Timothy | Parkin, Melissa | Saksena, Gordon | Voet, Douglas | Ramos, Alex H. | Pugh, Trevor J. | Wilkinson, Jane | Fisher, Sheila | Winckler, Wendy | Mahan, Scott | Ardlie, Kristin | Baldwin, Jennifer | Simons, Jonathan W. | Kitabayashi, Naoki | MacDonald, Theresa Y. | Kantoff, Philip W. | Chin, Lynda | Gabriel, Stacey B. | Gerstein, Mark B. | Golub, Todd R. | Meyerson, Matthew | Tewari, Ashutosh | Lander, Eric S. | Getz, Gad | Rubin, Mark A. | Garraway, Levi A.
Nature
2011;470(7333):214-220.
Prostate cancer is the second most common cause of male cancer deaths in the United States. Here we present the complete sequence of seven primary prostate cancers and their paired normal counterparts. Several tumors contained complex chains of balanced rearrangements that occurred within or adjacent to known cancer genes. Rearrangement breakpoints were enriched near open chromatin, androgen receptor and ERG DNA binding sites in the setting of the ETS gene fusion TMPRSS2-ERG, but inversely correlated with these regions in tumors lacking ETS fusions. This observation suggests a link between chromatin or transcriptional regulation and the genesis of genomic aberrations. Three tumors contained rearrangements that disrupted CADM2, and four harbored events disrupting either PTEN (unbalanced events), a prostate tumor suppressor, or MAGI2 (balanced events), a PTEN interacting protein not previously implicated in prostate tumorigenesis. Thus, genomic rearrangements may arise from transcriptional or chromatin aberrancies to engage prostate tumorigenic mechanisms.
doi:10.1038/nature09744
PMCID: PMC3075885
PMID: 21307934