|Home | About | Journals | Submit | Contact Us | Français|
The gene expression programs that establish and maintain specific cell states in humans are controlled by thousands of transcription factors, cofactors and chromatin regulators. Misregulation of these gene expression programs can cause a broad range of diseases. Here we review recent advances in our understanding of transcriptional regulation and discuss how these have provided new insights into transcriptional misregulation in disease.
The key concepts of transcriptional control were established half a century ago in bacterial systems (Jacob and Monod, 1961). That pioneering work and many subsequent studies established that DNA binding transcription factors (also known as trans-factors) occupy specific DNA sequences at control elements (cis-elements) and recruit and regulate the transcription apparatus. In eukaryotic systems, there has been extensive study of specific transcription factors and their cofactors, the general transcription apparatus, and various chromatin regulators, leading to a present-day consensus model for selective gene control (Adelman and Lis, 2012; Bannister and Kouzarides, 2011; Bonasio et al., 2010; Conaway and Conaway, 2011; Fuda et al., 2009; Ho and Crabtree, 2010; Roeder, 2005; Spitz and Furlong, 2012; Taatjes, 2010; Zhou et al., 2012b).
Our knowledge of mammalian regulatory elements and the transcriptional and chromatin regulators that operate at these sites has increased considerably in the last decade. There have also been substantial advances in our understanding of the control of large portions of the gene expression program in embryonic stem cells (ESCs) and in a number of more differentiated cell types. In these relatively well-studied cells, for example, it is now understood that a small fraction of the hundreds of transcription factors that are present dominate the control of much of the active gene expression program (Graf, 2011; Ng and Surani, 2011; Orkin and Hochedlinger, 2011; Young, 2011).
The recent insights into control of cellular gene expression programs have had an important impact on our understanding of misregulation of gene expression in disease. Many different diseases and syndromes, including cancer, autoimmunity, neurological disorders, diabetes, cardiovascular disease and obesity, can be caused by mutations in regulatory sequences and in the transcription factors, cofactors, chromatin regulators and noncoding RNAs that interact with these regions. New insights into the global effects of some of these mutations have recently emerged. These insights alter our view of the underlying cause of some diseases, and are the primary focus of this review.
We begin with a brief review of the basic features of human genes and the fundamentals of gene regulation. This leads to a discussion of cellular gene expression programs and the mechanisms involved in global regulation of transcription. We then describe how recent advances in our understanding of the control of gene expression have led to new insights into the mechanisms involved in misregulation of gene expression in various human diseases and disorders.
There are a remarkable variety and number of genes that are transcribed into protein-coding and non-coding RNA (ncRNA) species in mammalian cells (Table 1). The human genome is thought to contain approximately 20,000 protein-coding genes and at least as many ncRNA genes (Djebali et al., 2012). Functions have been determined or inferred for many of the protein-coding genes but less is understood about the functions of the ncRNA genes. Many of the ncRNAs contribute to control of gene expression through modulation of transcriptional or post-transcriptional processes (Bartel, 2009; Ebert and Sharp, 2012; Lee, 2012; Orom and Shiekhattar, 2011; Rinn and Chang, 2012; Wright and Ciosk, 2012). For example, the miRNAs, which are the best-studied of the various classes of ncRNAs, fine tune the levels of target mRNAs. Some of the long ncRNAs (lncRNAs) recruit chromatin regulators to specific regions of the genome and thereby modify gene expression and some apparently do not have a function but are simply a product of a transcriptional event that is itself regulatory (Latos et al., 2012).
Transcription factors typically regulate gene expression by binding enhancer elements and recruiting co-activators and RNA polymerase II to target genes (Lelli et al., 2012; Ong and Corces, 2011; Spitz and Furlong, 2012). Multiple transcription factors typically bind in a cooperative fashion to individual enhancers (Panne, 2008) and regulate transcription from the core promoters of nearby or distant genes through physical contacts that involve looping of the DNA between enhancers and the core promoters (Krivega and Dean, 2012). The core promoter elements, which include sites where transcription initiation occurs, can also be bound by certain transcription factors (Dikstein, 2011; Goodrich and Tjian, 2010).
Enhancers can be identified by profiling the locations of key transcriptional regulators genome-wide and testing whether these DNA elements are active in enhancer-reporter vectors, and a large population of embryonic stem cell (ESC) enhancers has been identified in this manner (Chen et al., 2008). Enhancers are occupied by nucleosomes with specific modifications and are sensitive to DNAse treatment, and these features can be used to identify putative enhancers when the key transcriptional regulators are not known (Buecker and Wysocka, 2012; Thurman et al., 2012). Approximately 1 million putative enhancers have recently been identified in the human genome by using, in multiple cell types, a variety of high-throughput techniques that detect these features of enhancers (Dunham et al., 2012; Thurman et al., 2012). These putative enhancers provide a resource for identifying regions of the genome where sequence variation may impact factor binding and gene regulation, and thus contribute to disease. Recent studies suggest that a considerable portion of the genetic variation that is associated with disease occurs in these regulatory regions (Maurano et al., 2012).
Transcriptional regulation occurs at two interconnected levels: the first involves transcription factors and the transcription apparatus and the second chromatin and its regulators (Figure 1). We briefly discuss the fundamentals of transcriptional control in this order, noting recent advances and reviews where the reader can obtain more detailed information.
Transcription factors can be separated into two classes based on their regulatory responsibilities: control of initiation versus control of elongation (Adelman and Lis, 2012; Fuda et al., 2009; Rahl et al., 2010; Yankulov et al., 1994; Zhou et al., 2012b). This distinction is not absolute, as some transcription factors may contribute to control of both initiation and elongation. Transcription factors typically bind cofactors, which are protein complexes that contribute to activation (coactivators) and repression (corepressors) but do not have DNA-binding properties of their own. Most transcription factors are thought to contribute to transcription initiation and do so by recruiting coactivators. These coactivators include the Mediator complex, P300 and general transcription factors (Juven-Gershon and Kadonaga, 2010; Malik and Roeder, 2010; Sikorski and Buratowski, 2009; Taatjes, 2010). Recent studies have highlighted the importance of Mediator in integrating information from transcriptional activators, repressors, signaling pathways and other regulators during transcription initiation and during the switch to elongation (Berk, 2012; Borggrefe and Yue, 2011; Conaway and Conaway, 2011; Kagey et al., 2010; Kornberg, 2005; Lariviere et al., 2012; Malik and Roeder, 2010; Spaeth et al., 2011; Taatjes, 2010).
Once the recruited RNA polymerase II molecules initiate transcription, they generally transcribe a short distance, typically 20–50bp, and then pause (Figure 1) (Adelman and Lis, 2012). This process is controlled by the pause control factors DSIF and NELF, which are physically associated with the paused RNA polymerase II molecules. The paused polymerases may transition to active elongation through pause release or they may ultimately terminate transcription with release of the small RNA species. Pause release and subsequent elongation occurs through recruitment and activation of P-TEFb (Positive Transcription Elongation Factor b), which phosphorylates the paused polymerase and its associated pause control factors. P-TEFb can be brought to these sites in the form of a large complex called the super elongation complex (SEC)(Luo et al., 2012a; Smith et al., 2011b). Additional complexes, such as PAFc, also contribute to the regulation of elongation (Jaehning, 2010). Transcription factors such as c-Myc stimulate P-TEFb-mediated release of RNA polymerase II from these pause sites and thus contribute to the control of transcription elongation (Rahl et al., 2010).
Recent studies have provided new insights into cofactors that play important roles in DNA loop formation and maintenance, which are key to proper gene control. During transcription initiation, the DNA loop formed between enhancers and core promoter elements is stabilized by cohesin, which is recruited by the Nipbl cohesin-loading protein that is associated with Mediator (Kagey et al., 2010). The cohesin complex has circular dimensions capable of encircling two nucleosome-bound molecules of DNA. Reducing the levels of cohesin or Nipbl has the same adverse effect on transcription as reducing the levels of Mediator, so these cofactors apparently play a similarly important role in gene activity (Kagey et al., 2010). Although cohesin is recruited to active promoters, it also becomes associated with the DNA-binding factor CTCF, which has been implicated in formation of insulator elements. Thus, cohesin is thought to have roles in transcription activation at some genes and in silencing at others (Dorsett, 2011; Hadjur et al., 2009; Parelho et al. 2008; Phillips and Corces, 2009; Schmidt et al., 2010; Seitan and Merkenschlager, 2012; Wendt et al. 2008).
The fundamental unit of chromatin, the nucleosome, is regulated by protein complexes that can mobilize the nucleosome or modify its histone components (Figure 1). Gene activation is accompanied by recruitment of ATP-dependent chromatin remodeling complexes of the SWI/SNF family, which mobilize nucleosomes to facilitate access of the transcription apparatus and its regulators to DNA (Clapier and Cairns, 2009; Hargreaves and Crabtree, 2011). In addition, there is recruitment, by transcription factors and the transcription apparatus, of an array of histone modifying enzymes that acetylate, methylate, ubiqutinylate and otherwise chemically modify nucleosomes in a stereotypical fashion across the span of each active gene (Bannister and Kouzarides, 2011; Campos and Reinberg, 2009; Gardner et al., 2011; Rando, 2012; Zhu et al., 2013). These modifications provide interaction surfaces for protein complexes that contribute to transcriptional control. Enzymes that remove these modifications are also typically present at the active genes, producing a highly dynamic process of chromatin modification as RNA polymerase is recruited and goes through the various steps of initiation and elongation of the RNA species.
Repressed genes are embedded in chromatin with modifications that are characteristic of specific repression mechanisms (Beisel and Paro, 2011; Cedar and Bergman, 2012; Jones, 2012; Moazed, 2009; Reyes-Turcu and Grewal, 2012). One type of repressed chromatin, which contains nucleosome modifications generated by the Polycomb complex (e.g., histone H3K27me3), is found at genes that are silent but poised for activation at some later stage of development and differentiation (Orkin and Hochedlinger, 2011; Young, 2011). Another type of repressed chromatin is found in regions of the genome that are fully silenced, such as that containing retrotransposons and other repetitive elements (Feng et al., 2010; Lejeune and Allshire, 2011). The mechanisms that silence this latter set of genes can involve both nucleosome modification (e.g., histone H3K9me3) and DNA methylation.
The set of genes that are transcribed largely defines the cell. The gene expression program of a specific cell type includes RNA species from genes that are active in most cells (housekeeping genes) and genes that are active predominantly in one or a limited number of cell types (cell-type-specific genes). In embryonic stem cells (ESCs), for example, at least 60% of the protein-coding genes are transcribed into full length mRNA species, but only a minority are cell-type-specific and thus defining for ESCs (Assou et al., 2007). Mammals contain hundreds and possibly thousands of cell types, and most of these have yet to be studied with respect to the set of transcripts they contain. Thus, the terms “housekeeping” and “cell-type-specific” are relative rather than absolute and have yet to be precisely defined. Furthermore, the “transcriptome” of specific cells, derived from high-throughput sequencing, does not show a distinct boundary between “active” and “silent” genes, but rather a broad distribution of RNA levels that ranges from less than one RNA molecule/gene/cell to millions of RNA molecules/gene/cell, and it is not clear what level is functionally sufficient for each RNA species.
The particular set of transcription factors that are expressed in any one cell type controls the selective transcription of a subset of genes by RNA polymerase II, thereby producing the gene expression program of the cell. Studies of the transcription factors that are key to establishing and maintaining specific cell states suggest that only a small number of the transcription factors that are expressed in cells are necessary to establish cell-type-specific gene expression programs (Figure 2). For example, although more than half of the ~1200 genes encoding transcription factors show some evidence of transcription in ESCs, only a few of these transcription factors are needed to reprogram a broad range of cell types into induced pluripotent stem cells (iPSCs) with features essentially indistinguishable from ESCs (Graf, 2011; Ng and Surani, 2011; Orkin and Hochedlinger, 2011; Yamanaka, 2012; Yeo and Ng, 2013; Young, 2011). These ESC transcription factors, which include Oct4, Sox2 and Nanog, are expressed at high levels, bind regulatory elements associated with most active ESC genes, are involved in Polycomb-mediated repression of genes that specify other cell types, and positively regulate their own gene expression through interconnected autoregulatory loops (Figure 3)(Young, 2011). Activation of these endogenous interconnected autoregulatory loops may be key to cellular reprogramming by introduction of exogenous transcription factors. Other cell types express cell-type-specific, or lineage-specific, master transcription factors that are likely to share these key properties of the ESC master transcription factors.
Most of the transcription factors that are key to control of cell state and that can act as reprogramming factors are thought to control transcription initiation at the genes they regulate. For example, the ESC transcription factors Oct4 and Nanog bind to the P300 and Mediator coactivators (Chen et al., 2008; Kagey et al., 2010), which can then drive the formation of open chromatin and recruitment of the transcription apparatus. Similarly, many of the transcription factors that can reprogram or trans-differentiate cells, including MyoD, C/EBPβ, HNF1α, HNF4 α, BRN2 and GATA4, bind to at least one of these coactivators (Borggrefe and Yue, 2011).
Recent studies have revealed that certain transcription factors can exert a broad effect on the gene expression programs of cells through elongation control (Figure 4). The c-Myc transcription factor can stimulate increased elongation from essentially the entire active gene expression program in diverse cell types (Lin et al., 2012; Nie et al., 2012; Rahl et al., 2010). The transcription factor AIRE functions to expand the set of genes that undergo RNA polymerase II pause release in specialized thymic stromal cells, allowing expression of the broad spectrum of self-antigens necessary to induce immune tolerance (Abramson et al., 2010; Giraud et al., 2012; Oven et al., 2007; Zumer et al., 2011). In hematopoiesis, the TIF1γ transcription factor controls erythroid cell fate by interacting with P-TEFb and regulating transcription elongation at a specific set of target genes (Bai et al., 2010). Development generally appears to be dependent on proper elongation control; the transcription elongation factor Tcea3 (TFIIS) contributes to the ability of ESCs to respond appropriately to differentiation cues (Park et al., 2012) and mutations in the P-TEFb repressor HEXIM cause gross developmental defects (Nguyen et al., 2012).
The key themes that have emerged from recent studies in transcriptional control and that are highlighted here are the following. Sequence variation in enhancers plays an important role in misregulation of gene expression and disease. A small number of key transcription factors dominate control of gene expression programs. Some transcription factors regulate transcription initiation while other factors control elongation, and factors that control this latter step can have profound effects on cell state. The Mediator coactivator complex integrates signals from diverse regulators and recruits cohesin complexes to active genes, which in turn contributes to both chromatin looping and gene activity. Diverse chromatin regulators mobilize nucleosomes and dynamically modify nucleosomes during active gene transcription and in gene silencing, and some chromatin regulators are regulated by lncRNAs. These advances in our understanding of sequences involved in gene control, transcriptional circuitry, the transcription apparatus and chromatin regulation have led to new insights into the mechanisms involved in misregulation of gene expression in various human diseases and disorders. We discuss some of these below.
Many diseases and syndromes are associated with mutations in regulatory regions and in transcription factors, cofactors, chromatin regulators and noncoding RNAs (Table S1). These mutations can contribute to cancer, autoimmunity, neurological disorders, developmental syndromes, diabetes, cardiovascular disease and obesity, among others. We highlight here several insights into disease mechanisms that have emerged from advances in our understanding of gene regulation.
Recent studies have highlighted the link between disease-associated variants in regulatory DNA and breast cancer (Jiang et al., 2011), prostate cancer (Demichelis et al., 2012), colorectal cancer (Lubbe et al., 2012), renal cancer (Schodel et al., 2012), lung cancer (Liu et al., 2011), nasopharyngeal cancer (Yew et al., 2012) and melanoma (Huang et al., 2013; Horn et al., 2013). The genome instability that is a hallmark of cancer almost certainly contributes to further alter sequences in regulatory regions that can promote tumor progression.
Mutations in transcription factors have long been known to contribute to tumorigenesis, and recent studies indicate that overexpressed oncogenic transcription factors can alter the core autoregulatory circuitry of the cell. The oncogenic transcription factor TAL1, which is overexpressed in approximately half of the cases of T-cell acute lymphoblastic leukemia (T-ALL), forms an interconnected autoregulatory loop with several key transcription factor partners, and this circuitry contributes to the sustained activation of TAL1-regulated oncogenic program (Sanda et al., 2012). Thus, high levels of TAL1 produce a modified autoregulatory circuitry that drives the oncogenic program in T-ALL.
Most tumor cells depend on the transcription factor c-Myc, for their growth and proliferation (Littlewood et al., 2012). MYC is the most frequently amplified oncogene and the elevated expression of its gene product is associated with tumor aggression and poor clinical outcome. Elevated levels of c-Myc can promote tumorigenesis in a wide range of tissues. In tumor cells expressing high levels of c-Myc, the transcription factor accumulates in the promoter regions of most active genes, recruits the transcription elongation factor P-TEFb, and causes transcriptional amplification, producing increased levels of transcripts within the cell’s gene expression program (Lin et al., 2012; Nie et al., 2012). Thus, rather than binding and regulating a new set of genes when overexpressed, c-Myc amplifies the output of the existing gene expression program (Figure 4). These results suggest that transcriptional amplification reduces rate-limiting constraints for tumor cell growth and proliferation.
Mutations in the Mediator coactivator complex have recently been implicated in the development of various tumors. Uterine leiomyomas, or fibroids, are benign tumors that affect millions of women. The MED12 gene is altered in the majority of uterine leiomyomas and its expression is absent in many uterine leiomyosarcomas, the malignant counterparts of leiomyomas (Makinen et al., 2011a; Makinen et al., 2011b). MED12 mutations also occur frequently in prostate cancer (Barbieri et al., 2012). MED12 is part of the CDK module of the Mediator complex, and the CDK8 subunit of this module has been reported to act as an oncogene in both colon cancer and melanoma (Firestein et al., 2008; Kapoor et al., 2010; Morris et al., 2008). Mediator has roles in gene activation and repression, and can function both in transcription initiation and elongation, so further study is needed to establish how Mediator mutations contribute to these tumors. Alterations in cohesin expression and function have been noted in some cancer cells and there is speculation that cohesin misregulation may also contribute to development of various cancers, but direct evidence for a role of cohesin in cancer remains to be established (Mannini and Musio, 2011; Xu et al., 2011).
Mutations in a variety of chromatin regulators have been implicated in development of cancer cells, and the normal functions of these regulators provide some clues to the mechanisms involved in altered gene expression. Loss of function mutations in several nucleosome remodeling proteins, including ARID1A, SMARCA4 (BRG1) and SMARCB1 (INI1) are associated with multiple types of cancer (Dawson and Kouzarides, 2012; Hargreaves and Crabtree, 2011; Tsai and Baylin, 2011; Wilson and Roberts, 2011), suggesting that defects in mobilizing nucleosomes near the promoters of active genes are involved. Similarly, various mutations in the Polycomb components EZH2 and SUZ12 and in the DNA methylation apparatus occur in multiple cancers, suggesting that in these instances it is the loss of proper gene silencing that contributes to tumorigenesis (Cedar and Bergman, 2012; Jones, 2012; Margueron and Reinberg, 2011; Mills, 2010). The majority of malignant melanomas overexpress SetDB1, a histone H3K9 methyltransferase that can contribute to gene activation or silencing, and this causes deregulation of HOX genes and accelerates melanoma (Ceol et al., 2011).
Gene fusions with the chromatin regulator MLL in leukemias are now known to alter transcription elongation (Luo et al., 2012b; Marschalek, 2010; Slany, 2009; Smith et al., 2011a). Several translocation partners of MLL are components of a super elongation complex (SEC) that includes P-TEFb and ELL proteins, which have also been shown to control transcription elongation (Lin et al., 2011; Lin et al., 2010; Luo et al., 2012a; Smith et al., 2011b). It is thought that translocation of any of the SEC subunits to the amino-terminal domain of MLL abnormally stabilizes the localization of the SEC at MLL target genes, including HOXA9 and HOXA10, which leads to excessive stimulation of RNA polymerase II into productive elongation at these genomic loci, which in turn contributes to aggressive acute leukaemia.
Specific lncRNAs have recently been implicated in cancer progression. The ANRIL lncRNA mediates transcriptional repression of members of the INK4a/ARF/INK4b locus, which encode tumor suppressors whose repression is associated with various cancers (Aguilo et al., 2011; Popov and Gil, 2010). ANRIL functions by recruiting polycomb repressive complexes 1 and 2 (PRC1 and PRC2) and misregulation of ANRIL may lead to abnormal silencing of tumor suppressors and thus contribute to cancer progression (Kotake et al., 2011; Yap et al., 2010). Interestingly, genome-wide association studies have identified numerous polymorphisms that affect the expression and processing of ANRIL and are associated with increased susceptibility to an increasing variety of disease states, including multiple types of cancer, coronary artery disease and type 2 diabetes (Pasmant et al., 2011; Harismendy et al., 2011).
Mutations in the autoimmune regulator (AIRE) protein cause type I autoimmune polyendocrinopathy syndrome. AIRE is a transcription factor whose role in promoting transcriptional elongation at genes with paused RNA polymerase II in the thymus explains why loss of AIRE function leads to autoimmune disease. Self-reactive T cells are normally eliminated during maturation in the thymus, due to the specialized ability of thymic stromal cells, and in particular medullary epithelial cells (MECs), to transcribe a large repertoire of genes encoding peripheral tissue antigens (Kyewski and Klein, 2006). This ectopic gene expression is controlled in a large part by AIRE, which is expressed almost exclusively in MECs. Mice and humans with an AIRE gene defect express only a fraction of the peripheral tissue antigens and develop immune infiltrates and autoantibodies directed at multiple peripheral tissues (Akirav et al., 2011; Gardner et al., 2009; Mathis and Benoist, 2009; Metzger and Anderson, 2011). Recent studies have shown that AIRE interacts with P-TEFb and influences transcription elongation in primary MECs (Abramson et al., 2010; Giraud et al., 2012; Oven et al., 2007; Zumer et al., 2011). AIRE is physically associated with all the active genes in MECs, but has its greatest effect on genes that do not experience pause release in its absence (Figure 4). These results are consistent with the idea that AIRE causes the release of RNA polymerase II molecules that are nonproductively paused at the promoters of a broad spectrum of genes that are otherwise expressed only in peripheral tissues. Thus in MECs, AIRE’s function is to expand the set of genes that undergo RNA polymerase II pause release.
Misregulation of the immune response transcriptional regulator NF-kB has been linked to inflammatory and autoimmune diseases, improper immune development and cancer. NF-kB is found in most cell types and is involved in cellular responses to stimuli such as infection and stress (Hayden and Ghosh, 2012). The transcription factor controls genes involved in inflammation, and is chronically active in inflammatory diseases such as inflammatory bowel disease, arthritis, sepsis, gastritis, asthma and atherosclerosis. Although most research into the mechanism of transcriptional activation by this and other regulators have focused on coactivator recruitment, evidence that NF-kB interacts with BRD4 and P-TEFb suggests that this ubiquitous regulator plays a role in elongation control at inflammatory genes during immune and stress responses (Barboric et al., 2001; Huang et al, 2009; Nowak et al., 2008). This view is supported by evidence that inhibitors of BRD4, which contributes to recruiting active P-TEFb, suppress expression of key inflammatory genes in activated macrophages and confer protection against lipopolysaccharide-induced endotoxic shock and bacteria-induced sepsis (Nicodeme et al., 2010).
Mutations in various components of the Mediator coactivator have been linked to a variety of neurological disorders and other developmental deficiencies (Ding et al., 2008; Goh and Grants, 2012; Hashimoto et al., 2011; Kaufmann et al., 2010; Leal et al., 2009; Philibert et al., 2007; Risheg et al., 2007; Rump et al., 2011; Schwartz et al., 2007; Zhou et al., 2012a). Mutations in MED23 alter the interaction between enhancer-bound transcription factors and Mediator, leading to transcriptional dysregulation of mitogen-responsive immediate-early genes that affect brain development and plasticity. A similar defect in immediate-early gene expression is observed in cells from patients with another intellectual disability, Opitz-Kaveggia syndrome, which is caused by MED12 mutations. It would not be surprising to find that additional Mediator mutations contribute to neurological disorders, given the role of this coactivator in integrating information from transcriptional activators, repressors, and signaling pathways.
Heterozygous germline mutations in components of the SWI/SNF chromatin remodeling complex were recently identified in patients with various neurological syndromes whose common features are severe intellectual disability and speech delay (Hoyer et al., 2012; Santen et al., 2012a; Santen et al., 2012b; Tsurusaki et al., 2012; Van Houdt et al., 2012). These mutations were found in SMARCB1,SMARCA4, SMARCA2, SMARCE1, ARID1A and ARID1B. It has been suggested that that up to 3% of unexplained intellectual disability may be caused by mutations in genes encoding SWI/SNF components (Santen et al., 2012b). It is interesting to note that ARID1B component of human SWI/SNF interacts with elongin C (Li et al., 2010), a component of the SIII transcription elongation factor, which enhances transcription elongation by suppressing transient pausing of RNA polymerase II (Aso et al., 1995). Thus, alterations in SWI/SNF complexes have the potential to affect both chromatin remodeling and transcription elongation.
Cohesinopathies are characterized by a wide variety of developmental defects, including growth and mental retardation, limb deformities, and craniofacial anomalies (Bose and Gerton, 2010; Liu and Krantz, 2008). This broad spectrum of phenotypes is now thought to be due to reduced cohesin loading and cohesin function in gene expression during development. A variety of cohesinopathies have been described, including Cornelia de Lange Syndrome and Roberts Syndrome, in which patients have mutations in the cohesin loading protein NIPBL or the proteins that constitute the cohesin complex. With recent evidence for roles of cohesin complexes in regulation of gene expression and DNA looping (Kagey et al., 2010; Kawauchi et al., 2009; Liu et al., 2009; Schaaf et al., 2009; Seitan et al., 2011), it has become apparent that these deficiencies lead to defects in transcriptional regulation and probably to the overall structure of chromatin in the nucleus of disease cells.
Diabetes mellitus is a group of metabolic diseases in which a person has elevated blood sugar, either because the pancreas fails to produce adequate amounts of insulin, or because cells do not respond properly to the insulin that is produced. Mutations in pancreatic master transcription factors and the sequences they bind have been implicated in diabetes. The gene expression programs of pancreatic cells appear to be controlled by a small set of key transcription factors, including HNF1α, HNF1β, HNF4α, PDX1 and NeuroD1, some of which contribute to the interconnected autoregulatory circuitry of these cells (Odom et al., 2004). Mutations in any of these factors can result in various forms of maturity-onset diabetes of the young (MODY) (Maestro et al., 2007; Malecki, 2005). These mutations almost certainly have a deleterious effect on the interconnected autoregulatory circuitry formed by these factors and their target genes. The frequency of single nucleotide polymorphisms (SNPs) that are linked to defects in glucose homeostasis and diabetes is greatly enriched in the binding sites for these transcription factors (Maurano et al., 2012). This observation indicates that perturbations that affect the regulatory circuitry of pancreatic cells may contribute to diabetes. It also suggests that previously undiscovered regulatory networks and network architectures may be uncovered by incorporating information about disease-associated genetic variants and knowledge of the binding sites of diverse transcription factors.
Misregulated development of the cardiovascular system is among the most common class of congenital birth defects and diseases of the cardiovascular system are among the most prevalent clinical issues for adult populations (Bruneau, 2008; Kathiresan and Srivastava, 2012; Roger et al., 2012). It is well-established that loss of function mutations in certain transcription factors cause various cardiovascular deficiencies (Table S1), but new studies have highlighted the role that mutations in ncRNA species can play in cardiovascular diseases. Specific miRNAs have been implicated in both the promotion and inhibition of differentiation into cardiac lineages, cardiac hypertrophy, vascular differentiation and erythropoiesis (Han et al., 2011; Papageorgiou et al., 2012; Small and Olson, 2011). MicroRNAs have also been linked to causative and protective roles for multiple types of cardiovascular disease, including arrhythmia, fibrosis, hypertrophy due to high pressure and misregulation of cardiac energy metabolism (Callis et al., 2009; Care et al., 2007; Grueter et al., 2012; Luo et al., 2008; Thum et al., 2008; van Rooij et al., 2007; van Rooij et al., 2008; Yang et al., 2007). MicroRNAs are thought to fine-tune gene expression and thus the alterations in these cases are thought to lead to deficiencies in fine-tuning the cardiovascular gene expression program.
Several concepts have emerged from recent studies of gene expression programs in healthy and in disease cells. Genetic variation may contribute to disease largely through misregulation of gene expression. Mutations in the transcription factors that control cell state may impact the autoregulatory loops that are at the core of cellular regulatory circuitry, leading to the loss of a normal healthy cell state. Some transcription factors control RNA polymerase II pause release and elongation and when their expression or function is altered, can produce aggressive tumor cells (c-Myc) or some forms of autoimmunity (AIRE). Mutations in the coactivator complexes that integrate information from many transcription factors and contribute to DNA looping can cause a broad spectrum of developmental diseases. Alterations in specific chromatin regulators can contribute to development of cancer and many other diseases. Misregulation of noncoding RNAs can also contribute to disease. Additional insights into the role of transcriptional misregulation in human disease will require improved genome annotation, knowledge of the DNA sequences whose alterations contribute to disease, identification of the key transcriptional regulators of all cells of medical relevance, and further understanding of the roles of cofactors, chromatin regulators and ncRNAs.
It is essential to improve human genome annotation in order to more fully understand gene expression programs and their regulation, and thus gene misregulation in disease. As a first step, it would be ideal to identify the complete set of protein-coding and non-coding genes and to ascertain which of these are actively transcribed in specific cell types. There are considerable challenges associated with defining the complete set of expressed genes in any one type of mammalian cell. Such characterization has traditionally required large numbers of cells and for most primary cell types, it is challenging to obtain a homogeneous population of cells. While protein-coding genes can be recognized, at least in part, by the presence of a coding sequence, it is challenging to produce a complete and accurate annotation of ncRNA genes due to limitations in the read length of widely used sequence technologies and the short lifetime of many ncRNAs. Nonetheless, recent studies have identified a vast number and variety of ncRNAs in human cells, so there is promise that improved human genome annotation is at hand (Djebali et al., 2012). Furthermore, new technologies allow investigators to monitor RNA polymerase II molecules that are actively engaged in transcription (Core et al., 2008). Such approaches have recently provided evidence that most long ncRNA (lncRNA) species are the product of divergent transcription from the promoters of active protein coding genes (Sigova et al., 2012). This suggests that most active protein-coding genes in humans are actually divergently transcribed mRNA/lncRNA gene pairs.
Knowledge of the sequence variation that contributes to disease is being gained at a rapid pace and this will improve our understanding of disease mechanisms and lead to new approaches to disease diagnosis and therapy. Several lines of evidence suggest that much of genetic variation contributes to disease through misregulation of gene expression. A substantial portion of the genomic sequences that are under positive selection are thought to be regulatory (Grossman et al., 2010). Disease-associated single nucleotide polymorphisms (SNPs) are enriched in regulatory regions (Ernst et al., 2011; Hindorff et al., 2009; Maurano et al., 2012). Many recent studies have identified links between disease-associated variants in regulatory DNA and a broad spectrum of human diseases, including cancer (Demichelis et al., 2012; Huang et al., 2013; Horn et al., 2013; Jiang et al., 2011; Liu et al., 2011; Lubbe et al., 2012; Schodel et al., 2012; Yew et al., 2012), congenital heart disease (Zhao et al., 2012), inflammatory lung disease (Han et al., 2012), multiple sclerosis (Alcina et al., 2012), Alzheimer’s disease (Gaj et al., 2012), abdominal aortic aneurysm (Bown et al., 2011), amyotrophic lateral sclerosis (Iida et al., 2011) and coronary artery disease (Harismendy et al., 2011). Disease-associated genetic variants in regulatory regions are most often found in regions that are utilized in a cell type-specific manner and are associated with diseases of the corresponding cell type (Ernst et al., 2011; Maurano et al., 2012). Thus, cell type-specific enhancer use can explain how genetic variants produce tissue-specific diseases. Disease-associated genetic variants can exist in regulatory regions that are very distant from the genes they control, but knowledge of the nature of loops between such distal enhancers and their target genes can explain how these distant variants affect a specific gene and its biological functions (Li et al., 2012; Maurano et al., 2012).
Genetic investigations and reprogramming studies suggest that only a small number of the hundreds of transcription factors that are expressed in cells are essential for establishing and maintaining the regulatory networks that produce specific cell states. If this holds true for most cell types, then it would be ideal to identify the key transcription factors for all cell types of medical relevance. It should be possible to identify these transcription factors if they have features identified for their counterparts in well-studied cells: relative high expression, occupancy of enhancers associated with a large fraction of active genes, and formation of interconnected autoregulatory loops. Discovering how gene expression programs are controlled in many different cell types should lead to further understanding of regulatory circuitry, facilitate cellular reprogramming, and accelerate the new field of regenerative medicine.
Cofactors and chromatin regulators are generally expressed in most cell types, but mutations in these genes often produce diseases or syndromes that exhibit tissue-specific disease phenotypes (Table S1). For example, defects in Mediator subunits contribute to nonsyndromic intellectual disability, Charcot-Marie-Tooth disease, Opitz-Kaveggia and Lujan syndromes, infantile cerebral and cerebellar atrophy and have been implicated in prostate cancer, in deficiencies in systemic energy homeostasis (Grueter et al., 2012) and in altered hair-cycling (Nakajima et al., 2012; Oda et al., 2012). Improved understanding of the interactions between cofactors and transcription factors and the mechanisms involved in information integration by these complex apparatuses will be valuable for understanding the mechanisms that produce tissue-specific phenotypes. Similarly, it will be important to further understanding the collaboration between the transcription apparatus and chromatin regulators in global control of gene expression programs. Recent efforts to target chromatin regulators for cancer therapy (Dawson et al., 2012) would benefit from a fuller understanding of the regulatory mechanisms and pathways that are impacted by these potential therapeutics.
Our future understanding of disease and the advance of personalized medicine will benefit from models of human transcriptional regulatory circuitry that integrate information about regulatory sequences and the key transcription factors, cofactors, chromatin regulators and ncRNAs that operate at regulatory sites. The development of these models should thus be among the priorities of biomedical research.
Table S1. Genes encoding transcription factors, cofactors, chromatin regulators and noncoding RNAs implicated in human disease
Our description of the themes highlighted in this review benefitted from discussions with Karen Adelman, Jay Bradner, Gerald Crabtree, Rudolf Jaenisch, Ian Krantz, Lee Lawton, David Levens, John Lis, Alex Marson, Matthias Merkenschlager, Alan Mullen, Duncan Odom, David Price, Peter Rahl, Robert Roeder, Ali Shilatifard, Phil Sharp, Alla Sigova, Alexander Stark, Dylan Taatjes and Leonard Zon. We thank David Orlando for help with data collation and analysis.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.