|Home | About | Journals | Submit | Contact Us | Français|
YY1 is an epigenetic regulator for a large number of mammalian genes. While performing genome-wide YY1 binding motif searches, we discovered that the olfactory receptor (OLFR) genes have an unusual cluster of YY1 binding sites within their coding regions. The statistical significance of this observation was further analyzed.
About 45% of the olfactory genes in the mouse have a range of 4-8 YY1 binding sites within their respective 1 kb coding regions. Statistical analyses indicate that this enrichment of YY1 motifs has likely been driven by unknown selection pressures at the DNA level, but not serendipitously by some peptides enriched within the OLFR genes. Similar patterns are also detected in the OLFR genes of all mammals analyzed, but not in the OLFR genes of the fish lineage, suggesting a mammal-specific phenomenon.
YY1, or YY1-related transcription factors, may help regulate olfactory receptor genes. Furthermore, the protein-coding regions of vertebrate genes can contain cis-regulatory elements for transcription factor binding as well as codons.
The transcription factor YY1 is a Gli-Kruppel type zinc finger protein that is highly conserved from insects through vertebrates . YY1 can function as an activator, repressor, or initiator depending upon the other regulatory elements in the region . YY1 also interacts with a variety of proteins including components of RNA polymerase II complex, transcription factors, and histone-modifying complexes [3-5]. According to genome-wide surveys, about 10% of all human genes contain YY1 binding motifs in their promoter regions . Functionally, YY1 is involved in many biological processes, including embryonic development, cell cycle progression, apoptosis, B cell development, Polycomb group Gene (PcG)-mediated repression, genomic imprinting, and X chromosomal inactivation [2,3,7]. YY1 was also initially identified as a factor controlling the transcriptional activity of the murine retrotransposon 'Intracisternal A Particle' . Since then, many retroposons, including SINE, LINE, and endogenous retrovirus families, have been shown to contain YY1 binding sites in their promoter regions [3,4]. Due to this ubiquitous presence of YY1 binding sites in genome-wide repeats, YY1 has also been regarded as a surveillance gene that is responsible for repressing transcriptional background noise from these repeats .
The olfactory receptor (OLFR) genes of mammals encode short, single coding exon, G protein-coupled receptors that are responsible for sensing a large number of air-borne scents . This gene family is comprised of over 800 and 1,300 gene members in human and mouse respectively, forming the largest gene family in mammalian genomes [10-13]. The aquatic vertebrates, the teleost fish lineage, also have a similar odorant receptor gene family . However, the odorant receptor (OR) family of the fish lineage consists of a much smaller number of genes than that of the mammals, and these OR genes are also much more diverse in sequence identity than those of mammals. Mammalian OLFR genes are divided into Class I and Class II groups based on sequence identity . Class II genes make up ~90% of OLFRs and are thought to have expanded during the transition to land-based living.
In mammals these olfactory receptors presumably expanded due to the selective advantage conferred by a well developed sense of smell . While mice and other mammals retain function and expression of almost all OLFRs, the majority of these are pseudogenized in humans . The mammalian OLFR genes are highly tissue-specific and are expressed primarily in the olfactory epithelium though a subset expresses in a chemosensory role in other tissues such as kidney and sperm [17-19]. Furthermore, only one copy (allele) out of all 1,000 OLFR genes is selected and expressed in each neuron cell of the olfactory tissue . The unusual transcriptional control of the OLFR gene family is likely mediated through unknown trans-acting factors . The tissue-specific nature of their expression coupled with their widespread duplication requires a mitotically-stable global silencing mechanism in all cell types. According to recent studies, potential cis-regulatory elements recruiting these trans factors are hypothesized to be located within the protein-coding regions of the OLFR genes rather than their surrounding genomic regions [22,23].
While performing genome-wide searches of the DNA-binding motifs of YY1, we discovered that the mammalian OLFR genes contain unusual clusters of YY1 binding sites within their protein-coding regions, whereas most YY1 binding sites are solitary and upstream of a regulated gene. In this study, we further analyzed the significance of this discovery with several bioinformatic and statistical measures, which will be described below. Specifically we test whether the presence of the YY1 binding sites could be explained by DNA sequence or amino acid motif conservation.
YY1 is predicted to be a global epigenetic regulator based on its ubiquitous expression and interaction with many histone-modifying enzymes . As part of the efforts exploring this possibility, we have performed several series of YY1 binding motif searches using the genome sequences of mammals (human, mouse, and cow) [24-26]. Here we first scanned the genome sequences of human and mouse using a Position Weight Matrix (PWM)-based Perl script. Repetitive elements are known to contain YY1 binding sites so the RepeatMasked genome was used . Later, the results of these searches were visualized using the repeat-masked Custom Track of the UCSC genome browser . While inspecting global localization patterns of YY1 binding motifs in each genome, we noticed that clusters of YY1 binding motifs are co-localized with the genomic regions harboring olfactory receptor (OLFR) genes. The mammalian OLFR genes show a single coding exon structure, and they are also localized as gene clusters in specific regions of mammalian chromosomes. One such example is shown using the 100-kb genomic region from Mmu 7 chromosome (Figure (Figure1).1). This figure shows a representative sample of 5 OLFR genes, and the locations of these genes coincide with those of YY1 binding motifs. Each OLFR gene appears to contain a range of 4 to 8 YY1 binding motifs within its 1-kb-long Open Reading Frame (ORF). As expected, the coding regions correlate well to the placental mammal conservation plot, a default track available on the UCSC genome browser. Also, the identified YY1 binding motifs appear to be random in location within the coding regions but do show a bias in orientation with respect to the direction of OLFR gene transcription, which will be described later.
The unusual clustering of YY1 binding motifs within the protein-coding regions of the OLFR genes was further analyzed to test if this pattern is unique to only the OLFR genes or also found in the ORFs of other genes. For this analysis, the entire set of the mouse mRNA database was scanned with the PWM-based Perl program to identify YY1 binding motifs. The number of the identified YY1 binding motifs within a given DNA sequence corresponding to the transcript was further divided by the size of the mRNA sequence, yielding a YY1 density score. The mouse mRNA sequences (total number = 20,191) were subsequently binned based on their relative YY1 density scores (the X-axis on Figure Figure2a).2a). The Y-axes of figure figure22 represents the number of genes within a given range of the YY1 density score. This analysis indicated that the majority of the non-olfactory mouse mRNA sequences (19,083 sequences) were distributed evenly and randomly throughout the varying ranges of the YY1 density score (0 to 0.114) (Figure (Figure2a).2a). In contrast, our detailed inspection revealed that about half of the OLFR gene set (1,108) show very high YY1 density scores. To better visualize this unusual pattern, we separated only the OLFR gene set from the rest of the mRNA set, and derived another histogram (Figure (Figure2b).2b). As shown in Figure Figure2b,2b, about 45% (or 496 of 1,108) of the OLFR genes have YY1 density scores ranging from 0.032 to 0.094, which are equivalent to 4 to 8 (or more) YY1 binding sites per a 1-kb mRNA sequence. In contrast, a scan of the 1 kb upstream regions of 20,419 Refseq genes revealed that only 10% of these putative promoter containing regions had a density greater than 4 YY1 binding sites.
We also repeated the above analyses using the mRNA data sets derived from human, cow, and zebrafish transcriptomes to test the evolutionary conservation of this unusual clustering of YY1 binding motifs within OLFR genes (Figure (Figure2c2c &2d). The results from human and cow (data not shown) mRNA data sets also showed a pattern consistent with the mouse data set: an unusual clustering of YY1 binding motifs within many OLFR genes. In humans we found 57% (or 215 of 380) of OLFRs contain more than 4 YY1 binding sites while percentage in cow rose to 82% (or 740 of 900). It is important to note that the total number of the human OLFR genes (380) in the figure is smaller than those of the other mammals since a large fraction of the human OLFR genes are known to have become pseudogenes in recent evolutionary times and we removed all genes annotated as hypothetical. In contrast, the OLFR genes of zebrafish do not show a similar pattern to mammals. The total 25 OLFR genes of zebrafish show as wide a range of the YY1 density scores as seen in the other non-OLFR genes (Figure (Figure2d).2d). These results confirm that the unusual clustering of YY1 binding motifs within the OLFR genes is a feature of the mammalian lineage.
We performed two different series of analyses to test the functional significance of the observed clustering of YY1 binding motifs within the mammalian OLFR genes. First, we tested whether some peptide motifs enriched within the protein sequences of the OLFR genes are responsible for a spurious display of the YY1 binding motifs in the nucleotide sequences of the OLFR genes. If some peptide motifs are enriched then we expect a significant difference in the number of those motifs found in OLFR genes versus the rest of the proteome. For this test, we identified 49 individual dipeptides that can be encoded by the core motifs (CCAT or ATGG) of the YY1 binding consensus sequence, and determined if any of these dipeptides were over-represented in the protein sequences of the OLFR genes (Figure (Figure3).3). We first calculated the frequencies of all 400 possible combinations of dipeptides using all other protein sequences except OLFR genes of the mouse to derive a reference set of the expected frequencies for the 400 dipeptides. In parallel, we also separately calculated the frequencies of the 400 dipeptides, including the 49 dipeptides that can encode the YY1 core motif, using only the protein sequences of the mouse OLFR genes. The frequencies derived from the OLFR genes (Observed Values) were compared with the reference set (Expected Values) via the Z-score to identify any over-represented dipeptides within the OLFR genes (Figure (Figure4a)4a) (Table (Table1).1). This global comparison did not immediately identify any dipeptides that are unusually enriched within the protein sequences of the OLFR genes. However, according to the detailed analyses using the Z-score values, two dipeptides (PL and AI) among the 49 dipeptides for the YY1 core motif showed significant enrichment (P < 0.01, Figure Figure4b).4b). Even so, this enrichment is limited to the 4 bp YY1 core binding site and not the complete 10 bp motif which we used to calculate YY1 density in OLFR genes. Furthermore, the reverse-translated sequences of these two peptides (PL and AI) have 24 and 12 fold degeneracy, respectively. This means that only one out of 24 PL or 12 AI dipeptides may contain the actual core motif of the YY1 binding consensus sequence. Therefore, the detection of 4 to 8 YY1 binding motifs per one OLFR gene cannot be simply accounted for by the serendipitous overlap between the YY1 core motif and a subset of codon combinations encoding frequent peptide motifs of the OLFR genes.
As a second measure, we carefully analyzed the positions of the identified YY1 binding motifs within the ORFs of the OLFR genes (Figure (Figure5).5). Conservation in location of YY1 binding site would indicate conservation of the encoded dipeptide motif whereas a lack of conservation in motif location is consistent with selection for the presence of the motif at the DNA level. The full length YY1 binding motifs within the OLFR genes do not show any patterns in their relative positions to the protein sequences of the OLFR genes as would be detected by conserved dipeptide motifs. However, there is a bias in the orientation YY1 binding sites in OLFR genes. OLFR genes contain an average of 3.4 forward sites for every reverse site per 1 kilobase. Non-OLFR genes show only a slight bias of 1.2 reverse sites for every forward site per 1 kilobase. In contrast, a similar analysis on the members of the histone 4 gene family (Hist1 h4) resulted in a different outcome. Two potential YY1 binding motifs were found within the protein-coding regions of this gene family, but the relative positions and orientations of these two motifs are identical and fixed among all the members of this gene family in mouse. The two identified YY1 binding motifs also coincide with two conserved peptide motifs (AM, and IA). Given the high levels of sequence conservation detected within the members of the histone 4 family (95% sequence identity), it is uncertain whether the identified YY1 binding motifs are genuine cis-regulatory elements or simply reflecting the serendipitous sequence match between the YY1 core motifs and some combinations of codons encoding conserved peptides. This pattern is in stark contrast to the random patterns associated with the position of YY1 binding motifs in the OLFR genes. This random pattern supports our initial idea that the clusters of the YY1 binding motifs within the OLFR genes most likely have been formed and maintained as cis-regulatory elements by purifying selection at the DNA sequence level. The bias favoring forward orientation may be an indication that forward sites enhance the efficiency of YY1 suppression of olfactory receptor genes. It is conceivable that multiple YY1 binding sites have been selected in OLFR genes with a preference for directionality as well as number. Alternatively, the non-random placement and orientation in histone 4 genes along with their extreme conservation prohibits a determination of whether selection is occurring at the DNA level or the protein level.
In the current study, we have shown an unusual enrichment of YY1 DNA-binding motifs in the OLFR gene family of mammals. About half of the members of the OLFR gene family have a range of 4-8 YY1 binding motifs within their protein-coding regions (Figures (Figures11&2). Statistical analyses further confirmed that this enrichment of YY1 binding motifs is consistent with functional relevance and has likely been driven by unknown selection pressure at the DNA level. Overall, the current study suggests a potential role of YY1 or YY1-related transcription factors in the regulation of the mammalian OLFR genes. Also, this study provides further evidence that the protein-coding regions of vertebrate genes can both encode codon information and contain cis-regulatory elements for transcription factor binding.
The mammalian OLFR genes are expressed primarily in olfactory neurons, and only one single copy (allele) out of the entire 1,000 family members is expressed and functional in a given neuron cell . This highly tissue-specific expression pattern of the OLFR genes necessitates a global repression mechanism for the majority of the OLFR genes in neural cells and in the other cell types. This mechanism acts prior to, and is separate from the negative feedback which prevents bi- and multi-allelic expression in olfactory neurons [29,30]. Since unknown mechanisms are believed to repress a large number of the OLFR genes all the time in most of the cell types, it is likely that these repression mechanisms are mediated through epigenetic modifications. Since YY1 is a well-known epigenetic regulator in the animal genome, it is plausible to propose that the identified YY1 binding motifs may play some roles in the predicted repression mechanisms for the OLFR genes . In that regard, it is important to note that YY1 is one of the Polycomb group Gene (PcG) members found in both invertebrates and vertebrates . Furthermore, recent studies hinted at the possibility that PcG-mediated repression mechanisms might be involved in the regulation of the OLFR genes . In mutant mouse embryonic stem cells lacking the embryonic ectoderm development (EED) protein, some OLFR genes do not replicate asynchronously as they differentiate, suggesting a loss of the typical pattern of monoallelically expressed genes . According to the results from another recent study, some cis-regulatory elements responsible for selecting one active OLFR copy in a given neuron cell are predicted to be located within the protein-coding regions of the OLFR genes . This is intriguing and consistent with the observation of the current study in that some critical cis-elements are located within the protein-coding regions of the OLFR genes. The genetic code is optimal for containing multiple layers of encoded information within protein-coding regions . In sum, although further investigation is warranted, it is likely that the identified YY1 binding motifs are functionally important cis-regulatory elements for the regulation of the OLFR genes.
The YY1 binding motifs identified within the OLFR genes are unusual since they are localized within the protein-coding regions of these genes and are present at high density. This is quite different from the typical pattern in which transcription factor binding sites (cis-regulatory elements) are located in the genomic regions surrounding the protein-coding regions of genes. OLFR genes give a fitness advantage by duplication and divergence while remaining active, yet must also retain regulatory information. According to our statistical analyses (Figure (Figure33&4), the identified YY1 binding motifs within the OLFR genes likely represent evolutionarily selected cis-regulatory elements. Previous studies have shown high levels of YY1 binding affinity to the type of YY1 binding motifs found within OLFRs . Though the functionality of the identified YY1 binding motifs remains to be demonstrated in vivo, it is plausible that some functional constraints serve to maintain the YY1 binding motifs within the protein-coding regions of the OLFR genes. In one scenario these motifs might be linked to the sudden expansion of this gene family in mammalian genomes. The copy number of the OLFR genes has increased dramatically in recent evolutionary times in mammalian genomes, providing a large number of receptor proteins for airborne scents . The clustering of the OLFR genes in chromosomal regions also suggests this gene family may have been duplicated through in situ tandem duplications . Tandem duplication is known to be the most frequent mechanism in increasing copy numbers for gene families . However, it is not well understood how this mechanism carries over the proper information for the transcriptional regulation of duplicated gene copies. In the case of the mammalian OLFR genes, their protein-coding regions may have both information for codon and cis-regulatory elements so that the duplication of these genes would most likely guarantee their coding potential as well as associated transcriptional control. This might have been one functional constraint for the co-evolution of the YY1 binding motifs within the protein-coding potential of the OLFR genes.
The current study reports that an unusual enrichment of YY1 binding sites, 4-8 binding sites per gene, are located in the coding regions of olfactory receptor genes in mammals. According to statistical analyses, these YY1 binding sites most likely have been selected as cis-regulatory elements. Also, similar patterns are found in other mammals, but not in fish, suggesting a mammalian-specific phenomenon. This study further suggests YY1 or YY1-related transcription factors as regulators of mammalian OLFR genes.
A custom Perl script, matrix-bidirectional.pl, was run against mouse chromosome files available from Ensemble (version NCBIM37.49) ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/ (see Additional file 1) . The program calculates score by matching a 10 bp window to a matrix of the likelihood for each position (Additional files 2 &3) along the 10 bp consensus sequence (the Position Weight Matrix, PWM). Each base pair is given a value equivalent to the decimal percentage of its match to the known YY1 binding motifs. The four base pair core, CCAT, is scored at 100% for each base plus 2. Flanking bases vary in score from 0 to 1 based upon their frequency in known YY1 binding motifs. The total score for each 10 bp window is calculated and compared to our cutoff score of 8.0 which indicates a good match. Our previous work revealed that scores above 8.0 correlate with good YY1 binding in vitro [1,35]. Output was generated in the WIG format for each chromosome with each YY1 binding motif score represented by start and end position and bar height corresponding to the PWM score match. Position weight matrix scores and location information were loaded into the University of California, Santa Cruz (UCSC) Genome Browser for visualization of YY1 location (Figure (Figure1)1) .
Our Perl script was run against the mouse, human and zebrafish mRNA available from NCBI ftp://ftp.ncbi.nih.gov/genomes/. Each gene was scored for number and quality of YY1 motifs. Results were sorted by a YY1 density score, the combined score of YY1 motifs divided by the length of the gene. Predicted and hypothetical genes were removed from the mouse yielding 40,009 total genes (19,818 removed, 20,191 remaining), with 19,083 non-olfactory and 1,108 olfactory genes. Hypothetical genes were removed from the human transcriptome, yielding 24,886 total genes with 24,506 non-olfactory and 380 olfactory genes. Hypothetical genes were removed from the zebrafish transcriptome, yielding 9,092 genes with 9,067 non-odorant genes and 25 odorant receptors. We removed the hypothetical and predicted genes because they may not contain complete ORFs. The upstream 1 kb regions of 20419 Refseq genes in the mouse was obtained from UCSC http://hgdownload.cse.ucsc.edu/downloads.html.
Histograms were made in Microsoft Excel 2007 by sorting the genes by YY1 density from high to low, then assigning a count to order the genes. Olfactory receptor genes were separated from non-olfactory receptor genes and the count numbers were used as position information to make a histogram which shows the distribution of OLFRs along the range of YY1 containing genes (Figure (Figure22).
The mouse proteome was downloaded from NCBI ftp://ftp.ncbi.nih.gov/genomes/M_musculus/protein/protein.fa.gz which contains 34,966 peptide sequences and nomenclature. Olfactory receptor proteins (1,178) were separated from non-olfactory proteins (33,788). A Perl script, dipep-singlefile.pl was used to generate a count of each of the 400 possible dipeptides in each of these groups (see Additional file 4).
A Z-test was performed according to the formula Z = (observed - μ)/δ where observed is the count of each possible dipeptide found in olfactory receptors, μ is the mean of the counts from the whole population of non-olfactory proteins, and δ is the standard deviation of the population count. Z-score units are given in standard deviations from the mean. Expected count was calculated by multiplying the frequency of each dipeptide from all non-olfactory receptor proteins by the total number of dipeptides seen in OLFR proteins. Figure Figure44 shows the plot generated in Microsoft Excel 2007 comparing the observed to expected ratio for all 400 possible dipeptides and the subset of 49 YY1 core-forming dipeptides which exhibits no difference in the distribution of the subset.
We found 21 of the 49 dipeptides which can make up a YY1 binding site had greater than expected values, but only 2 were over 3 standard deviations away from the mean (Table (Table1,1, Figure Figure4).4). Table Table11 shows only the dipeptides in which the observed count in olfactory genes was higher than the expected count in mouse.
Global alignment was done using ClustalW with 1,178 OLFR and 6 hist1 h4 amino acid sequences from mouse (Figure (Figure5)5) . All Perl scripts are available for download on our website at http://JooKimLab.lsu.edu.
YY1: Yin-Yang 1; SINE: Short interspersed element; LINE: Long interspersed element; OLFR: Olfactory receptor; OR: Odorant receptor; PWM: Position weight matrix; UCSC: University of California at Santa Cruz; ORF: Open reading frame; WIG: Wiggle format; NCBI: National center for biotechnology information.
CF carried out the coding, genome scanning, performed the statistical analyses, and participated in the design of the study. JK conceived of the study, and participated in its design and coordination. All authors read and approved of the final manuscript.
Position Weight Matrix Perl program. Perl program "matrix-bidirectional.pl.txt" uses position weight matrix files and fasta formatted input files to generate positional and scoring data for YY1 binding sites.
YY1 forward matrix. Position weight matrix file for "matrix-bidirectional.pl.txt" named "YY1_F.matrix".
YY1 reverse matrix. Position weight matrix file for "matrix-bidirectional.pl.txt" named "YY1_R.matrix".
Dipeptide count Perl program. Perl program "dipep-singlefile.txt.pl" requires fasta formatted input files to generate dipeptide counts.
We thank Drs. John Caprio, Mike Hellberg, and Andrew Whitehead for critically reading the manuscript. We also thank the anonymous reviewers for helpful comments. This research was supported by National Institutes of Health R01 GM66225 (J.K).