|Home | About | Journals | Submit | Contact Us | Français|
Mutations in any genome may lead to phenotype characteristics that determine ability of an individual to cope with adaptation to environmental challenges. In studies of human biology, among the most interesting ones are phenotype characteristics that determine responses to drug treatments, response to infections, or predisposition to specific inherited diseases. Most of the research in this field has been focused on the studies of mutation effects on the final gene products, peptides, and their alterations. Considerably less attention was given to the mutations that may affect regulatory mechanism(s) of gene expression, although these may also affect the phenotype characteristics. In this study we make a pilot analysis of mutations observed in the regulatory regions of 24,667 human RefSeq genes. Our study reveals that out of eight studied mutation types, “insertions” are the only one that in a statistically significant manner alters predicted transcription factor binding sites (TFBSs). We also find that 25 families of TFBSs have been altered by mutations in a statistically significant manner in the promoter regions we considered. Moreover, we find that the related transcription factors are, for example, prominent in processes related to intracellular signaling; cell fate; morphogenesis of organs and epithelium; development of urogenital system, epithelium, and tube; neuron fate commitment. Our study highlights the significance of studying mutations within the genes regulatory regions and opens way for further detailed investigations on this topic, particularly on the downstream affected pathways.
Mutations in any genome may lead to phenotype characteristics that determine ability of an individual to cope with environmental challenges (Kopp and Hermisson, 2009). In studies of human biology there is a continuous effort to identify mutations and this data can be freely accessed from public repositories such as HapMap (Belmont et al., 2005), dbSNP (Sayers et al., 2010), and the 1000 Genomes project (Altshuler et al., 2010). It is well known that phenotypic specificities determine how individuals react to drugs (Batist et al., 2011; Callahan and Abercrombie, 2011), to infections (González-Hernández et al., 2010), or how predisposed they are to inherited diseases such as cystic fibrosis (Gu et al., 2009; Antigny et al., 2011), Huntington’s disease (Roze, 2011), galactosemia (Bennett, 2010), et cetera. Most of the research in this field has been focused on the studies of non-synonymous mutation effects on the final gene products, peptides, and their alterations (Kumar et al., 2009). Several databases have been developed in order to facilitate the study of these genetic variations and their consequences. Resources such as the Human Gene Mutation Database (HGMD; Stenson et al., 2009) and the Online Mendelian Inheritance in Man (OMIM) database (Amberger et al., 2009) collect mutations occurring across the entire human genome, whereas other repositories such as the FLCN Gene Database (Lim et al., 2010) are locus specific.
However, a considerably less attention has been given to the mutations that may affect regulatory mechanism of gene expression, although these may also affect phenotype characteristics. Previous studies of the interactions of mutations and regulatory processes have associated mutations within the promoter region of certain key genes, with an increased susceptibility to different disorders such as Pancreatic cancer (Hamacher et al., 2009), type 2 Diabetes (Song et al., 2009), Myelodysplastic syndrome (Ma et al., 2010), and Idiopathic Pulmonary Arterial Hypertension (Yu et al., 2009). Polymorphisms in the intragenic or regulatory regions may influence transcription factor (TF) binding to DNA or may affect gene splicing (Heckmann et al., 2010; Kasowski et al., 2010). A number of resources has been compiled to provide more easy insights into effects of mutations to transcription and gene regulation, such as those related to protein coding genes (Conde et al., 2006; Kim et al., 2008), or for miRNA (Bao et al., 2007; Hariharan et al., 2009; Alexiou et al., 2010; Schmeier et al., 2011).
Here we performed a large-scale pilot study of mutations within the regulatory regions of 24,667 human RefSeq (Pruitt et al., 2009) genes. Our study reveals for the first time that out of eight studied mutation types, “insertions” is the only type that alters the predicted transcription factor binding sites (TFBSs) in a statistically significant manner. We also identified 25 families of TFBSs have been altered by mutations in a statistically significant manner in the promoter regions we considered. The related TFs are prominent in processes related to intracellular signaling, cell fate, epithelium, morphogenesis of organs and epithelium, development of urogenital system, epithelium and tube, neuron fate commitment, et cetera. These observations highlight the significance of studying mutations with the genes regulatory regions and opens way for further detailed studies on this topic, particularly on the downstream affected pathways.
We analyzed promoter regions of 24,667 human genes from RefSeq for the presence of TFBSs using Transfac Professional ver. 11.4 (Matys et al., 2006) and we predicted 1,077,742 TFBSs in these promoter regions. At the same time we found 343,024 mutations associated with the same promoter regions based on data from dbSNP. Of these, 122,023 TFBSs were altered by 104,514 mutations. Details of the mapped mutations and TFBSs by chromosome are provided in Tables S1 and S2 in Supplementary Material, respectively.
We analyzed the following eight mutations: “Single,” “Insertion,” “Deletion,” “In-del,” “Multiple Nucleotide Polymorphism” (MNP), “Mixed,” “Named,” and “Microsatellite.” “Single” represents single nucleotide variation; with all observed alleles are single nucleotides (can have 2, 3, or 4 alleles). “Microsatellite” corresponds to the situation when the observed allele from dbSNP is a variation in counts of short tandem repeats. “Named” represent polymorphisms in presence of complex structures, as transposons, e.g., (Alu)/-. “Mixed” corresponds to the cluster containing submissions from multiple classes. MNP represent situation where the alleles are all of the same length, and length >1. “Insertion” corresponds an insertion relative to the reference assembly. “Deletion” corresponds to a deletion relative to the reference assembly. “In-del” corresponds to situation when both insertions and deletions relative to reference genome were found in particular position. Our analysis suggests that only the insertion type of mutations alters TFBSs in a statistically significant manner (Table (Table1).1). On the other hand, the analysis of TFBSs altered by mutations indicates that 25 TFBS types are altered in a statistically significant manner. We associated TFs to these 25 TFBS types and found that they comprise: HNF3 alpha, Pax-2, Pax-3, Pax-4, Pax-5, Pax-6, AIRE, PLZF, myogenin/NF-1, ZNF219, FOX factors, STAT1, CHX10, HNF3 beta, c-Maf, Tax/CREB, FOXP1, MyoD, “c-Ets-1 p54,” KROX, DEAF1, VDR, CAR, PXR, FAC1, PPARalpha:RXRalpha, PPARgamma:RXRalpha, and Spz1. We also found the initiator element, “Muscle initiator sequences-19,” is altered in a statistically significant manner (Table (Table2).2). Details including TFs UniProt IDs (Jain et al., 2009) are provided in Table S3 in Supplementary Material. We used GeneMANIA (Warde-Farley et al., 2010) to analyze processes where the TFs which bind to the above mentioned 25 TFBS types potentially exert their effects and found (Table S4 in Supplementary Material) that in addition to activities usually associated with TF functioning, they are also prominent in processes related to intracellular signaling, cell fate, epithelium, morphogenesis of organs and epithelium, development of urogenital system, epithelium and tube, neuron fate commitment, response to nutrient levels, developmental processes, response to extracellular stimulus.
The global analysis of mutations within TFBSs makes space for more detailed insights into potential effects that such mutations can produce. For example, a mutation within a transcription initiation regulatory region can make one of the following downstream effects:
Studying such different scenarios in particular cases may provide insights into mutation effects (see Heckmann et al., 2010). Schmeier et al. (2011) have developed a database that provides a possibility to explore such potential effects of interactions of mutations with TFBSs in the promoter regions of miRNAs.
Due to degenerative sequence properties of TFBS, one can assume that many SNPs that overlap TFBSs will not affect the ability of TF to bind them. At the same time, larger mutations (such as insertions, deletions, MNP, etc.) are more likely to significantly affect affinity of TF to bind such modified TFBSs, and likely to destroy TFBSs. So, negative selection would reduce frequency of long polymorphisms within TFBSs and for this reason, such long polymorphisms are not likely to be significantly enriched within TFBSs. This is why our finding that “insertions” are statistically significant is an interesting one. Further research is needed to suggest the reason behind such an observation.
We observed that TFBSs of two TF families, PAX and FOX, are prominently associated with mutations. TFs from PAX family are associated with tissue specific gene expression and linked to development of specific tissues, including kidney and optic nerves (PAX-2; Lindoso et al., 2009); ear, eye, and facial development (PAX-3; Zhang et al., 2012); pancreatic islet beta cells (PAX-4; Collombat et al., 2009); b-cell differentiation, as well as neural and spermatogenesis development (PAX-5; Decker et al., 2009) and eyes and sensory organs, certain neural, and epidermal tissues development (PAX-6; Guo et al., 2010; Rowan et al., 2010). Forkhead box (FOX) family of TFs is implicated in processes of embryonic development, cell growth, proliferation, and cell differentiation (Hannenhalli and Kaestner, 2009).
While insertion type of mutations was the only one that appeared statistically significantly altering TFBSs, the other seven types could be considered to be results of uniform random changes of the genome.
The results obtained provide the foundation to investigate potentially affected pathways that are controlled by the TFs binding the most affected TFBSs and provide us links to important biological processes related to these TFs.
Our analysis revealed that “insertion” is only statistically significant type of mutations in the predicted TFBSs in the regulatory regions of human genes that we explored. Also, we singled out 25 TFBS families that in our analysis appear statistically significantly altered by mutations. These open possibility to further explore individual effects of the altered TFBSs and their downstream and upstream regulation networks that can pave way for insights into pathways and diseases potentially affected.
We extracted promoters regions of 24,667 human genes from RefSeq (Pruitt et al., 2009). Promoters covered the region of [−1000, +500] relative to 5′end of gene. Human genome version hg19 from UCSC Genome Browser database (Fujita et al., 2011) is used.
We downloaded 3,3026,121 SNPs from the UCSC Genome Browser database (Fujita et al., 2011). These SNPs are derived from dbSNP build 132 (Sayers et al., 2010) and are available on the hg19 assembly of the human genome. The resulting set contains 319,820 polymorphisms. Based on genomic coordinates we identified SNPs that overlap promoter sequences, as well as those altering predicted binding sites, using custom Perl scripts.
We used Transfac Professional database ver.11.4 (Matys et al., 2006) and its associated Match program to map all binding sites of vertebrate TFs to the promoter region. We used high quality matrices and optimized threshold setting for “minimum FP.” This allows for the TFBS predictions with presumed minimal number of false positive predictions.
For the TFs that correspond to the above mentioned 25 TFBS types, we used GeneMANIA (Warde-Farley et al., 2010) program to find out potentially enriched GO categories associated with these TFs.
To find overrepresented types of mutations within all TFBSs, we applied the right-sided exact Fisher’s test to contingency tables (example is shown in Table Table3)3) with Bonferroni correction for multiplicity testing. For each mutation types, we calculated the total number of mutations of the considered type that overlapped any TFBS or fell outside of any predicted TFBS. As a background we used the total number of mutation of all other types that altered any TFBS or fell outside of any predicted TFBS. To find TFBS altered significantly by any particular type of mutations we applied the exact Fisher’s test to contingency tables (example is shown in the Table Table4)4) with Bonferroni correction for multiplicity testing. For each TF, we calculated the total number of different mutations altering any TFBS for a given TF or falling outside of such TFBS. As a background we used the total number of TFBSs of all other TF containing or not containing any mutations.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: http://www.frontiersin.org/Statistical_Genetics_and_Methodology/10.3389/fgene.2012.00100/abstract