|Home | About | Journals | Submit | Contact Us | Français|
Members of the large ETS family of transcription factors (TFs) have highly similar DNA-binding domains (DBDs)—yet they have diverse functions and activities in physiology and oncogenesis. Some differences in DNA-binding preferences within this family have been described, but they have not been analysed systematically, and their contributions to targeting remain largely uncharacterized. We report here the DNA-binding profiles for all human and mouse ETS factors, which we generated using two different methods: a high-throughput microwell-based TF DNA-binding specificity assay, and protein-binding microarrays (PBMs). Both approaches reveal that the ETS-binding profiles cluster into four distinct classes, and that all ETS factors linked to cancer, ERG, ETV1, ETV4 and FLI1, fall into just one of these classes. We identify amino-acid residues that are critical for the differences in specificity between all the classes, and confirm the specificities in vivo using chromatin immunoprecipitation followed by sequencing (ChIP-seq) for a member of each class. The results indicate that even relatively small differences in in vitro binding specificity of a TF contribute to site selectivity in vivo.
We currently know very little about the molecular mechanisms that control tissue-specific gene expression, and about the variations in gene expression that underlie many pathological states, including cancer. This is in part due to the lack of information about the ‘second genetic code'—the binding specificities of transcription factors (TFs). Deciphering this regulatory code will allow us to explain observations (i.e. ChIP, expression profiling) based on biochemical principles. The ultimate aim is to read the genetic code of gene expression, that is, understand the expression of genes based on DNA sequence.
To begin to address these questions, we have in this work concentrated on the study of the large ETS family of TFs, whose members have diverse functions and activities in physiology and oncogenesis (Bartel et al, 2000; Sharrocks, 2001; Kumar-Sinha et al, 2008). The first ETS factor identified was ETS1, which was discovered as a homolog of the avian leukaemia virus E26 oncogene in 1983 (Leprince et al, 1983; Nunn et al, 1983). Subsequent analyses have identified a total of 27 and 26 ETS-family members in human and mouse genomes, respectively (Bult et al, 2008).
ETS factors have both developmental functions (Schober et al, 2005), and functions in differentiated tissues and cells (Bartel et al, 2000). They are critical for vasculogenesis/angiogenesis, hematopoiesis and neuronal development (Bartel et al, 2000; Vrieseling and Arber, 2006). Cellular responses to activated ETS factors include cell proliferation, differentiation and migration (Sharrocks, 2001; Schober et al, 2005), depending on the type and state of the responding cell.
Many ETS proteins, including ETV4 in mammals and Yan in Drosophila are transcriptional targets of signalling pathways (Schober et al, 2005; Vrieseling and Arber, 2006). Activity of ETS proteins can also be modulated directly by phosphorylation; members of the ETS and ELK subgroups of ETS factors mediate transcriptional responses to Ras/MAPK signalling pathways in species ranging from Caenorhabditis elegans to humans (Brunner et al, 1994; O'Neill et al, 1994; Beitel et al, 1995; Sharrocks, 2001). The mechanism of activation of the ELK factors in response to activation of Ras also appears to be conserved between species (Wasylyk et al, 1997).
Translocations altering the activity of several members of the ETS family are associated with multiple types of human cancer. In translocations observed in some cancer types, the ETS DNA-binding domain (DBD) is lost, and the ETS partner contributes a regulatory domain to another class of DBD (e.g. ETV6-RUNX1; Golub et al, 1995; Mavrothalassitis and Ghysdael, 2000). More commonly, the cancer-associated translocations result in fusion of a strong transcriptional activator domain to the ETS DBD (e.g. EWS fused to FLI1 or ERG in Ewing's sarcoma; Delattre et al, 1992; Sorensen et al, 1994) and/or overexpression of an ETS-family member due to introduction of a strong cis-regulatory element upstream of it (Tomlins et al, 2005, 2007). In fact, the most common known cancer-associated translocation is the TMPRSS2-ERG fusion, which introduces a strong androgen receptor (AR)-dependent regulatory element upstream of the ERG gene (Tomlins et al, 2005). Together with other translocations involving ETV1 and ETV4, over 50% of all prostate cancer cases display hyperactivity of ETS proteins (Kumar-Sinha et al, 2008).
All ETS factors share a conserved winged helix-turn-helix DBD of ~85 amino acids, and all analysed members of this family bind to a consensus DNA sequence containing a core 5′-GGA(A/T)-3′ motif (Karim et al, 1990; Nye et al, 1992). On the basis of phylogenetic analysis of the DBDs, the ETS family has been subdivided into 12 different subgroups (Laudet et al, 1999; Hollenhorst et al, 2007). Thus, although all ETS DBDs are relatively highly conserved, different ETS proteins might exhibit a preference for different flanking sequences to differentially bind to specific DNA sites, and thus regulate distinct biological processes. However, there exists no systematic and uniform analysis of ETS-binding specificities, and whether differences in binding specificity (if any) relate to targeting in vivo. An earlier analysis showed that there were differences between published motifs for different ETS-family members, but these differences did not reflect amino-acid features, and might be due to differences in the experimental methods used in different studies (Kielbasa et al, 2005).
In this work, we describe the first comprehensive genome-wide analysis of binding specificities of the ETS TF family. We find that the ETS-family DNA-binding specificities fall into four distinct classes, and confirm this finding by identifying the key DNA-contact amino acids that contribute to class specificity. We further perform ChIP-seq analyses for representative ETS factors to map the ETS-binding sites in vivo in Ewing's sarcoma, leukaemia and prostate cancer cells. These analyses provide a systematic genome-wide map of ETS DNA-binding specificities in vitro and in vivo. Remarkably, the genome-wide data reveal that even small differences in ETS DNA-binding preferences can contribute to in vivo targeting specificities.
To determine the binding specificities of the ETS factors, we first cloned all human and mouse ETS DBDs and human ETS full-length cDNAs (Figure 1; Supplementary Table S1). Two parallel methods were used to independently determine relative DNA sequence-specific binding affinities: high-throughput microwell-based TF DNA-binding specificity assay (Hallikas and Taipale, 2006; Hallikas et al, 2006) and protein-binding microarrays (PBMs; Berger et al, 2006). As these two strategies are based on different principles, they act to complement and cross-validate each other.
In the microwell-based assay, human and mouse ETS DBDs were expressed as fusion proteins to a Renilla luciferase enzyme. The TF-Renilla luciferase fusion proteins were incubated with biotinylated double-stranded oligonucleotide containing a sequence with high affinity to all known ETS factors in the presence of an excess of mismatched competitor oligos. The binding data were then analysed to produce a position weight matrix (PWM) of the TF-binding site.
Independent analysis of the mouse ETS family was carried out using PBMs, which allow determination of TF-binding specificities through sequence-specific binding of individual TFs directly to double-stranded DNA microarrays containing all possible 10-mer binding sites (Berger et al, 2006).
Both methods generated similar binding profiles (Figure 1; Supplementary Table S2), with all of the ETS factors binding to the previously described core GGA(A/T) motif. Of the 27 factors we studied, 13 had been previously analysed using different methods to yield a partial binding specificity. Our results were similar, but not identical, to these earlier studies, as described in the following sections.
We next analysed the differences in the obtained profiles to determine which ETS factors have similar binding specificities. For this purpose, we developed a computational method that allows determination of similarity between TF motifs using the minimum Kullback–Leibler divergence between all translations and reverse complementations of the multinomial distributions defined by the motifs. This analysis revealed that all ETS profiles were relatively similar to each other, and clearly divergent from publicly available non-ETS TF-binding profiles (Figure 2). The ETS-binding profiles fell into four distinct classes (Figure 2), containing 15, 8, 3 and 1 member(s), respectively. These classes were robustly identified using results either only from the microwell-based method (Figure 2), from only the PBM method (Figure 3A and B) or from the combination of the two (Supplementary Figure S1A). The classes were named according to their respective sizes, with class I being the largest group, containing the cancer-associated ETS factors ERG, ETV1, ETV4 and FLI1. Consistent with earlier results (Kielbasa et al, 2005), clustering analysis of ETS factors available from current databases and literature did not yield a clear classification of sites (Figure 3C; Supplementary Figure S3). However, the classes we obtained do show a clear relationship to groupings based on amino-acid features (see below and Discussion).
The main differences between our motifs are concentrated on the core +4 position and 5′ flanking base pairs. Although the consensus sequences of the ETS factors are relatively similar, many somewhat weaker sites are much more class specific or exclude one or more classes of ETS DBDs (Supplementary Figure S2). Only the difference between other ETS-family members analysed and the lone class IV factor SPDEF has been identified earlier (Oettgen et al, 2000).
In general, the class definitions derived using hierarchical clustering seemed to be largely sufficient to explain the differences between the ETS-family members. However, ETV6 and ETV7 appeared to have subtly different binding specificity at +4 compared with the other members of class II (Figure 1), and in this way resembled more the class III factors. We therefore propose subclassification of class II into class IIa containing the ELF-family factors, and class IIb comprising ETV6 and ETV7.
To analyse the molecular basis of the differences in ETS-binding specificities, we investigated the amino acid-DNA contacts in published crystal structures of ETS1, GABPA, ELK1, ELF3, SPI1 and SPDEF–DNA complexes (Kodandapani et al, 1996; Batchelor et al, 1998; Mo et al, 1998, 2000; Garvie et al, 2001; Verger and Duterque-Coquillaud, 2002; Pufall et al, 2005; Wang et al, 2005; Lamber et al, 2008; Agarkar et al, 2010). The invariant GGA core bases of the ETS-family profiles are consistent with the absolute conservation of two key DNA-contacting arginines in Helix3 (Figure 4A, black). Most of the differences in DNA-binding specificity at particular bases, in turn, correlate with corresponding changes in residues contacting DNA at or near these bases (Figure 4B). The preference of the lone class IV factor, SPDEF for T at +4 correlates with the presence of serine and glutamine at DNA-contact residues 9 and 11, respectively. Recent crystal structure analysis of SPDEF–DNA complex suggested that combination of these residues is responsible for the preference of T at +4 (Wang et al, 2005). We confirmed the importance of these two residues by mutagenesis followed by microwell-based DNA-binding specificity assay (Figure 4C).
Class I factors are characterized by low affinity to C in the −3 position. This change correlates with a substitution of a leucine that contacts DNA backbone at −2 and −3 with a phenylalanine or tyrosine (Figure 4B, red). Mutation of the leucine residue to either tyrosine or phenylalanine in the context of the class II factor ELF4-DBD resulted in a clear shift of specificity towards class I at −3 position, confirming the importance of this amino acid for the differences in specificity between class I and the other classes (Figure 4D).
Whereas Class IIa factors did not have major features that differentiated them from all other classes, class IIb factors displayed strong preference for A at +4. This change correlated with a substitution of a key tyrosine by a histidine. Mutation of this residue in ELF4-DBD increased binding to sequences containing A at +4, confirming the importance of this residue in class IIb specificity (Figure 4D).
Class III factors were characterized by preference of G and C at −2 and −1, respectively, strong preference for A at +4, and relatively strong binding of sites with a C at +5. Many amino-acid residues within the DBD are specific for class III (Figure 4B). As the tyrosine that affects specificity at +4 is replaced in class III by an asparagine, we first investigated the function of this amino-acid change. Mutating the tyrosine to an asparagine in ELF4-DBD led to a clear change of specificity towards that of class III both at +4 and +5 (Figure 4D).
Preference for G at −2 in class III could, in turn, be due to substitution of a tyrosine—which contacts the DNA backbone between bases −1 and −2—with a glycine or an alanine (Figure 4B). However, mutation of this residue in ELF4-DBD to either glycine or alanine had no impact on specificity. In contrast, mutation of a class III-specific glutamate that contacts water molecule between positions −1 and 1 (Kodandapani et al, 1996) to a glutamine residue led to a clear change of specificity of ELF4 towards class III at positions −1 and −2 (Figure 4D).
Taken together, these results indicate that the ETS DNA-binding specificities fall into four clearly distinct classes (Figures 2, 3A and B), and that the molecular mechanisms of the differences in four kinds of ETS-binding motifs are due to amino-acid divergences at defined DNA-contacting residues (marked 5, 9, 11 and 14 in Figure 4A). The observed changes in binding specificity are also consistent with crystal structures for the ETS-family members (see Figure 5; Kodandapani et al, 1996; Batchelor et al, 1998; Mo et al, 1998; Mo et al, 2000; Garvie et al, 2001; Pufall et al, 2005; Wang et al, 2005; Lamber et al, 2008; Agarkar et al, 2010).
To further validate the binding profiles, and to examine binding of ETS factors to DNA in vivo, we used ChIP-seq to determine occupied sites for each class of ETS factors in cell lines. Antibodies to endogenous proteins were used in all experiments. Class I and class IV factors ERG and SPDEF were analysed in the androgen-dependent prostate cancer cell line VCaP, and the class II and class III factors ELF1 and SPI1 were analysed in leukaemia cell lines Jurkat and HL60, respectively. In addition, we included in the analysis two oncogenic ETS-fusion proteins, EWS/ERG and EWS/FLI1, which were analysed in the Ewing's sarcoma cell lines CADO-ES1 and SK-N-MC, respectively. We also performed ChIP-seq using two additional antibodies in VCaP cells, histone H3 lysine 4 monomethylation to identify potential enhancers, and the non-ETS TF, AR, which served as a positive control. IgG was used as a negative control in all cell lines (see Materials and methods for details). Analysis of the results revealed between 2142 (ERG) and 98 290 (H3K4 monomethylation) significantly enriched regions (P<0.005), which we refer to as ‘peaks' hereafter (Supplementary Table S4; Supplementary data files S2, S3, S4, S5, S6, S7, S8, S9). Randomly selected ChIP-seq peaks were confirmed using ChIP-qPCR (see Materials and methods; Supplementary Table S5 for details).
For the ETS-fusion proteins EWS/FLI1 and EWS/ERG, the ChIP-seq peaks were located relatively evenly in the genome. The non-fusion ETS-family factors ERG, SPDEF and SPI1 demonstrated a relatively small enrichment near transcription start sites, whereas ELF1 displayed a very strong preference for binding near TSSs (Figure 6A; Supplementary Table S4). Consistent with earlier results, we also observed an enrichment of peaks within potential enhancer elements identified by proximity to nucleosomes where Histone H3 lysine 4 is monomethylated (Supplementary Table S4; Heintzman et al, 2007; Hollenhorst et al, 2009).
To determine whether the ChIP-seq peaks were near genes regulated by the respective ETS factors, we used RNAi to downregulate FLI1 in SK-N-MC Ewing's sarcoma cells. Two different siRNAs were used to rule out off-target effects (Echeverri et al, 2006), and two biological replicates were used for each siRNA to decrease noise. The affected genes included novel and previously known targets of FLI1 (Tirode et al, 2007; Supplementary Table S3). FLI ChIP-seq peaks were strongly enriched near transcription start sites of the FLI1-target genes (Figure 6B). Although the FLI1 peaks in general were distributed relatively evenly in the genome, the enrichment of FLI1 peaks near target genes was strongest very close to the TSS, suggesting that many of the more distal peaks have little impact on transcription, whereas more proximal peaks are more likely to affect gene regulation.
Given the relatively similar DNA-binding specificities of all the ETS-family members, an important question is whether the specificity of binding comes from the observed relatively small differences in protein–DNA binding affinity, or whether direct protein–protein interactions, or more complex chromatin-mediated effects dominate.
To address this, we analysed the ChIP-seq peak sequences to determine the relative enrichment of sequences that bind with high affinity to the different ETS classes in vitro. All ChIP-seq peak sequences were strongly enriched in matches to matrices representing specific ETS classes, and in each case, the most enriched class corresponded to the class of the factor analysed by ChIP-seq (Figure 6C).
To analyse the sequence characteristics of the peaks further, we selected sequences from the 150 most significant peaks that were narrower than 400 bp for each ChIP-seq experiment. Searching these sequences for overrepresented motifs using MEME (Bailey and Elkan, 1994) revealed a clear ETS-family signature in experiments analysing each ETS class. The bases characteristic for the different ETS classes were also confirmed by the MEME analysis (Figure 6D). However, consistent with earlier results for FLI1 (Gangwal et al, 2008), analysis of EWS/ERG and EWS/FLI1 peak sequences resulted in identification of a motif resembling GGAA microsatellite repeats (Supplementary Figure S8A). This result suggests that the EWS-fusion alters the binding site selectivity of the ETS proteins.
Although high-affinity in vitro sites were strongly enriched near the summits of the ChIP-seq peaks, only a small fraction of such sites in the whole genome were occupied in the cell lines tested. In ERG and SPDEF ChIP-seq peaks from VCaP, the fraction of high-affinity sites that were occupied was much higher within 500 bp of regions that were positive for histone H3 lysine 4 monomethylation, a known marker for nucleosomes that flank enhancer regions (Heintzman et al, 2007; Supplementary Table S4). These results suggest that accessibility of DNA is a major determinant of site occupancy.
We next analysed the overlap between the sites occupied by the ETS factors (Figure 7A). As expected, analysis of the same factor in different cells (ERG in VCaP and CADO-ES1) or different factors in the same (ERG and SPDEF both in VCaP) or similar (FLI1 and ERG in Ewing's sarcoma; ELF and SPI1 in leukaemia) type of cell appeared to increase overlap of the peaks. Still, the majority of the peaks were specific in all experiments, with overlap ranging between 0.9% (EWS/FLI1 and SPDEF) and 25.8% (ERG and SPDEF). Strikingly, overlap between ERG peaks from VCaP and the control AR peaks was much higher than for any other pair of factors; 44.4% of all ERG peaks in VCaP cells overlapped with AR peaks (Figure 7A). GO enrichment analysis revealed that regions near common AR and ERG peaks are enriched in genes involved in nucleosome and chromatin assembly (data not shown).
We further performed comparisons with earlier reports. In all, 171 (51.3%) of total 333 ELF1-specific occupancy regions discovered by ChIP-chip in Jurkat (Hollenhorst et al, 2007) overlapped with our ELF1 ChIP-seq peaks. For comparison, we also analysed overlap of our ELF1 ChIP-seq peaks with recent ChIP-seq data of the ETS-family members ETS1 and GABPA in Jurkat cells (Valouev et al, 2008; Hollenhorst et al, 2009) (see Materials and methods; Supplementary Table S4 for details). Earlier studies showed the redundant occupancy of the ETS-family members at promoter proximal regions (Hollenhorst et al, 2007, 2009). Indeed, comparisons between overlap results from top significant peaks of ETS1, GABPA and ELF1 showed higher overlap of peaks in promoter regions (Supplementary Table S4). Outside promoters, it appears that there is a higher overlap between class I factor ETS1 and GABPA peaks compared to overlap of either ETS1 or GABPA with the class II factor ELF1 peaks (Supplementary Table S4).
As many ETS-family members are known to form composite sites with other TFs, we also analysed the ERG peaks that overlapped with AR peaks to see whether the strong overlap between ERG and AR could be explained by the presence of such a composite site in the overlapping peaks. MEME analysis of the overlapping peaks yielded separate ETS and AR signatures, with no obvious composite site (Supplementary Figure S8B).
We report here the binding specificities of all human and mouse ETS-family TFs. Earlier phylogenetic analyses based on the ETS domains using the CLUSTALW algorithm have classified these factors into 12 different groups (Laudet et al, 1999; Hollenhorst et al, 2007). We report here that the ETS-domain DNA-binding specificities fall into only four major distinct classes, which we name classes I to IV based on the number of members in each class. Class II is further subdivided into classes IIa and IIb, based on a more subtle change in binding specificity within this group. The binding specificity classification we report here is broadly similar to that generated by aligning the ETS-domain peptide sequences from multiple species using the MUST program (Laudet et al, 1999) but differs from that generated by ClustalW using just human sequences by Hollenhorst et al (2007) or by us (Supplementary Figures S4 and S5). A tree that is more consistent with our binding specificity data was also obtained using the Prank algorithm (Supplementary Figure S6), which uses phylogeny-aware gap placement and scores insertions and deletions more accurately than other alignment programs (Löytynoja and Goldman, 2005, 2008). Alignment analysis of the ETS domains by Prank fails only to clearly identify the class II factors as a single group; instead, the class II members are placed in several branches. One possible explanation is that the class II represents the ancestral specificity from where the other classes diverged.
Of the four classes of the ETS factors, crystal structures exist for classes I, IIa, III and IV. Our results are consistent with these structures, and the changes in binding specificity correlate with changes in amino acids contacting DNA at or near the base pairs affected (Figure 5). Our results also suggest that further structural studies of ETS factors should concentrate on class IIb factors, and cocrystals between ETS factors and other TFs.
Earlier studies have suggested that further specificity exists within the class I ETS DBDs; Elk4 has been reported to have higher specificity towards sequence ACAGGATGT than Elk1 (reviewed in Verger and Duterque-Coquillaud, 2002). These earlier results are based on SELEX with only 11 and 20 sequences analysed (Shore and Sharrocks, 1995; Shore et al, 1996) and are not statistically significant (not shown). We do not observe significant differences between Elk1 and Elk4 DNA-binding specificity in either of the independently performed DNA-binding specificity assays, suggesting that the differences in binding observed earlier were related to changes in overall affinity of the ETS factors to DNA as opposed to effect on specificity of binding to different sequences. This interpretation is also supported by existing crystal structure data (Mo et al, 1998), which indicates that Tyr 65 of Elk4 makes base-specific contacts with both T and A at +4, whereas the corresponding tyrosine in Elk1 does not contact DNA at all. This should result in lower overall affinity of Elk1 to DNA.
The broad biological functions of the distinct classes of ETS factors do not appear to be clearly separate, for example class I, IIb and III factors are all involved in hematopoiesis, albeit at different steps (Bartel et al, 2000). Further class-specific functions could, however, be revealed by combining multiple knockouts within the same ETS class. Gain-of-function evidence does suggest that there is some class specificity in biological functions, as only class I ETS DBDs are found in cancer-associated fusion proteins. Further analysis of the critical targets of the oncogenic ETS factors is needed to reveal the basis of this selectivity.
Existing information seems to indicate that different ETS-family members have divergent binding specificities, and no clear subgroups could be identified based on clustering of the existing binding profiles (Figure 3C). Our results show that the previously observed differences are largely attributable to experimental variation and different methods used rather than actual differences in specificity. Similarly, individually analysed homeobox-TF-binding profiles were found to be more divergent compared with those reported in two systematic studies (Berger et al, 2008; Noyes et al, 2008). Given the diversity of the currently available binding profiles, even for the same factors (see Figure 3C), it is important that systematic approaches such as those described here are extended to all TFs.
These results also highlight the improved precision and accuracy that can be obtained in high-throughput experiments, where all experiments are similarly designed and carefully controlled. Protein–DNA interaction analyses necessarily measure a relatively large number of individual affinities. This makes the assay type and laboratory variation inherent in single-protein studies more apparent than comparison between high-throughput and single-protein studies in other fields (such as protein–protein interactions, where only one Kd is measured).
The differences between our completely independently performed microwell and PBM assays were relatively small, and comparison using either data set alone did not reveal more classes of factors than comparisons including all data (see Supplementary Figure S1A). However, principal component analysis of the matrices could be used to separate matrices based on the assay used (see Supplementary Figure S1B). Thus, there was a minor systematic difference between microwell and PBM-derived matrices, but its magnitude was so small that it is unlikely to affect downstream analyses.
The ChIP-seq experiments confirmed that the in vitro specificity analyses were relevant also for TF-binding in vivo (Figure 6C and D). We found that high-affinity sites were strongly enriched near the summits of the ChIP-seq peaks, and that a substantial fraction of high-affinity sites within accessible genomic regions were occupied (Supplementary Table S4). In addition, comparison of ChIP-seq data to expression profiling data revealed that only a small fraction of FLI1 ChIP-seq peaks are located in close proximity to genes that are strongly regulated by loss of FLI1. Whereas FLI1 peaks were located randomly with respect to transcription start sites, enrichment of FLI1 peaks near FLI1-target genes was clearly higher near the TSS (Figure 6B). It thus appears that most occupied sites have little if any effect on gene expression, and that proximity of the occupied site to a TSS is important in determining whether an ETS binding event affects transcription. However, it is well established that long-range enhancers have pivotal functions in mammalian gene regulation (Heintzman and Ren, 2009), and our results do not mean that all distal sites are non-functional.
Although overlap was observed between the different ETS classes, most of the peaks observed were specific for a given factor, underscoring the specificity of the ETS-family members. A notable exception was found in the VCaP prostate cancer cell line. Activity of two classes of TFs, the AR, and the class I ETS factors ERG, ETV1 and/or ETV4 are implicated in prostate cancer. We found here that a very large fraction (44.4%) of ERG occupied sites are close to sites occupied by AR, and 30.4% of these sites are also bound by SPDEF (not shown). These results suggest that the translocation/deletion that places a strong androgen-responsive element from the TMPRSS2 gene adjacent to the ERG-coding sequences will result in aberrant feed-forward regulation of target genes common to both AR and ERG (Figure 7B).
The human genome contains large number of TFs contributing to complex gene regulation, accurate developmental patterning and growth control (Messina et al, 2004). Many classes of TFs, including members of the ETS and HOX families have relatively similar binding specificities (Berger et al, 2008; Noyes et al, 2008). Despite similar specificities and overlapping expression patterns (Galang et al, 2004; Hollenhorst et al, 2004; Richardson et al, 2010; Supplementary Figure S7), loss-of-function studies have revealed that ETS-family TFs have very specific functions during development (Bartel et al, 2000). An important question is how such specificity is achieved despite relatively similar DNA-binding specificity.
We provide here direct evidence that even the relatively small differences in ETS-domain DNA-binding specificity affect in vivo site occupancy. Although the consensus sequences of the ETS factors are very similar, many somewhat weaker sites are much more class specific or exclude one or more classes of ETS DBDs (Supplementary Figure S2). Such selectivity is clearly evident also in our ChIP-seq analyses; we found clear enrichment of class I ETS-binding sites over binding sites of the other classes in regions immunoprecipitated by the class I ETS-family members ERG and FLI1. Similar enrichments were observed for all the other classes as well (Figure 6C). These results indicate that the ETS-class specificities reported here contribute to site selectivity of the ETS-family TFs in vivo.
Whereas it is possible that more sensitive methods could, in the future, be used to further subdivide ETS-domain DNA-binding specificities, such differences would necessarily be even smaller than those reported here. Thus, it appears that DNA-binding specificity differences alone cannot explain the full diversity of the ETS family, as there are 27 ETS TFs and four major classes of DNA-binding specificity. One common mechanism explaining how loss of similar proteins can cause different phenotypes is that their expression patterns are different. In mouse embryos, the ETS-family members show distinct but partially overlapping expression patterns (Supplementary Figure S7), suggesting that at least part of the functional specialization within the classes can be explained by the divergent expression patterns (Richardson et al, 2010). This hypothesis is also supported by knock-in experiments that show that the class III ETS factor SPIB can replace another class III factor SPI1 (SFPI1) in mouse myeloid development (Dahl et al, 2002; DeKoter et al, 2002). In contrast, the class I ETS factor ETS1 cannot rescue SFPI1 loss.
Another important mechanism to achieve specificity involves cooperative binding of ETS factors with other TFs. The protein-binding surfaces of ETS factors are different, and different ETS factors associate with different other TFs to bind distinct composite sites (Verger and Duterque-Coquillaud, 2002). For example, the class I factors ETS1, ELK1, ELK4 and FLI1 have different binding partners. ELK1 or ELK4 can bind DNA together with SRF (Dalton and Treisman, 1992; Cooper et al, 2007; Boros et al, 2009), FLI1 associates with SMAD3 (Ravasi et al, 2010) and ETS1 can bind to composite sites with PAX5 (Garvie et al, 2001) and RUNX1 (Hollenhorst et al, 2007, 2009). In the cases analysed, the composite sites are distinct from ETS consensus sequences either at the flanking regions, or even at the core region. ETS1 and PAX5 interact to recognize an element containing a modified ETS core sequence GGAG instead of the consensus GGA(A/T) (Fitzsimmons et al, 1996, 2001; Garvie et al, 2001).
Formation of the composite sites can affect also in vivo binding specificity, and this has been demonstrated in the case of ETS1/RUNX1 (Hollenhorst et al, 2009). Although such cooperative interactions could potentially explain the differences we observe in in vivo binding for the members of the different ETS classes, we did not find obvious motifs corresponding to other TFs in our MEME analysis of the ETS factors ERG, EWS/ERG, EWS/FLI1, ELF1, SPI1 and SPDEF. This suggests that these factors partner with multiple TFs in such a way that any given composite site is present in relatively small numbers—and thus cannot be detected by the algorithm used. Improvement of methods to systematically map such interactions between ETS-family members and other TFs is needed to fully understand in vivo specificity differences within each ETS class.
Taken together, in this work, we systematically analysed the ETS family of TF DNA-binding specificities for two species (human and mouse) with a single high-throughput assay (microwell based). The DNA-binding specificities of ETS-family TFs in the mouse genome were determined by two independent methods (microwell based and PBM). Both sets of in vitro data are consistent and reveal four clear subclasses of ETS DNA-binding preferences. We further dissected molecular basis for the specificity in DNA recognition by systematic site-directed mutagenesis of key amino acids in the ETS DBDs. Through ChIP-seq mapping of ETS-binding sites in different cell models, we found that the preferences observed for ETS DNA-binding in vitro can contribute to site selectivity in vivo on a genome-wide scale.
SK-N-MC cells were grown in EMEM, Jurkat, HL60, CADO-ES1 in RPMI1640, and COS1, 293T and VCaP in DMEM. All media were supplemented with penicillin/streptomycin and fetal bovine serum (10%).
Sequences coding for the human and mouse ETS domains with 10–25 amino acids of flanking sequence (with exception of ETS domains in N- or C-terminal regions) were cloned into pMAGIC1 (Li and Elledge, 2005) or directly to pGEN expression vector (Taipale et al, 2002) from Megaman cDNA library (Stratagene) and from mouse-pooled cDNAs (mouse 12.5 days embryonic and fetal brain cDNA library; a kind gift from Professor Tomi Mäkelä, University of Helsinki). The inserts in pMAGIC1 were transferred to pMAGIC-DEST vector containing C-terminal Renilla luciferase. All the human ETS full-length cDNAs were cloned into Gateway pDONR221 vector. For expression analyses, the clones were transferred into modified pDEST40 (Invitrogen) vectors containing C-terminal triple V5 or Renilla luciferase tags.
Validation of high-throughput data was performed as follows: TF-binding assays were performed using two different methods in two different laboratories. The ChIP analyses were validated using single ChIP-qPCR with different antibodies for 11–42 randomly selected peaks (Supplementary Table S5). In expression analysis, two different siRNAs for each factor were used to rule out off-target effects (Echeverri et al, 2006), and two biological replicates were used to decrease noise. Thirty-five randomly selected up- or downregulated genes were validated using qPCR (Supplementary Table S5).
Microwell-based TF DNA-binding assay was performed as described (Hallikas and Taipale, 2006). The method is based on competition between binding sites, and measures relative sequence-specific DNA-binding affinity of a TF. Briefly, TF-Renilla luciferase fusion proteins expressed in COS1 or 293T cells were incubated with competitor oligonucleotides indicated in the presence of a biotinylated oligonucleotide containing the ETS consensus-binding sequence (Forward: ACGCTAACCGGATATAACGCTA; Reverse: TAGCGTTATATCCGGTTAGCGT) (Nye et al, 1992; Woods et al, 1992; Hallikas et al, 2006). A scrambled oligonucleotide (Forward: ACGCTAAACAGTGTCAACGCTA; Reverse: TAGCGTTGACACTGTTTAGCGT) was used to control for non-sequence-specific DNA-binding affinity. Bound TF-Renilla luciferase activity was measured using a luminometer (BMG Fluostar Optima) and normalized to yield TF-binding positional weight matrix as described in Hallikas and Taipale (2006). DBDs were used to determine binding profiles, as initial experiments indicated that significant differences were not observed between profiles obtained using full-length proteins or DBDs for GABPα or ETS1 (Supplementary Figure S9A). Biotinylated oligonucleotides with GGAT core sequence were used as this allowed efficient assay in all classes of ETS factors. Control experiments indicated that use of GGAT core instead of GGAA did not markedly affect results (Supplementary Figure S9B). Microwell-based assay was performed using 3–6 replicate measurements for each competing DNA sequence (see Supplementary Table S6) for all factors. Replicate measurements were compared with each other using the novel algorithm described below. Results shown represent averages from all replicates that were within minimum Kullback–Leibler divergence of 0.5 (see below).
Independent analysis of the mouse ETS family was carried out using PBMs, which analyse binding of TFs to double-stranded DNA microarrays synthesized with all possible 10 bp DNA sequences (Berger et al, 2006). The ETS TF proteins were purified from Escherichia coli or from in vitro translation reactions. Binding reactions were performed with 39–100 nM (see Supplementary Table S1) of protein using PBM array design #015681 essentially as described in Berger et al (2006).
Comparison of binding profiles was performed using a novel algorithm that determines the similarity between TF motifs using the minimum Kullback–Leibler divergence between all translations and reverse complementations of the multinomial distributions defined by the motifs. Conceptually, the TF-motif divergence measures the information gained about the DNA sequence by knowledge of having binding sites for both of the two factors. The TF-motif divergence is defined as the minimum Kullback–Leibler divergence between all translations and reverse complementations of the multinomial distributions defined by the two TF motifs. The longer motif is inserted to a sequence with background distribution and the shorter motif is slid over the background/longer motif sequence. The KL divergence is computed between the multinomial distributions defined by (1) the shorter motif and (2) the part of the background/longer motif sequence overlapping the shorter motif. The same is repeated with the background/long motif sequence reverse complemented and the minimum of the KL divergences is taken. The TF-motif divergence is symmetric but does not fulfill the triangle inequality and thus is not a metric in the mathematical sense. The TF motifs are clustered with hierarchical average linkage clustering (Mahony and Benos, 2007) based on the TF-motif divergences. The TF-motif divergence bears similarity to earlier comparison strategies (Roepcke et al, 2005; Mahony and Benos, 2007) in the use of KL divergence but as far as we know, taking the minimum is a novel feature.
Four matrices representative of the different ETS classes were selected using affinity propagation clustering (Frey and Dueck, 2007). This method does not derive an average, but identifies the matrix that is most representative of each group (these were used as the class matrices in Figure 2). The exemplar preferences were uniform on all motifs and the common preference was selected to provide pre-chosen number of clusters.
ChIP analysis of VCaP, Jurkat, HL60, SK-N-MC and CADO-ES1 was performed as described earlier (Metivier et al, 2003) with minor modifications described in Robertson et al (2007). For details and antibodies used, please see Supplementary data.
A detailed description of ChIP-seq DNA library preparation and complexity estimation (Supplementary Table S4), peak calling (Audic and Claverie, 1997; Li et al, 2008; Nix et al, 2008; Laajala et al, 2009; Pepke et al, 2009), motif analysis by MEME (Bailey and Elkan, 1994) (Figure 6D; Supplementary Figures S8 and S11), motif enrichment in peaks and peak overlap analysis is included in the Supplementary data. Peak positions (NCBI36 coordinates) are in Supplementary data files S2, S3, S4, S5, S6, S7, S8, S9, and sequencing reads are publicly available at NCBI Sequence Read Archive under accession no. SRA014231.
For siRNA knockdown of EWS/FLI1 in SK-N-MC, the individual set of four siRNAs (Qiagen) against each gene were tested for knockdown efficiency by qRT–PCR, and two most effective single siRNA were used for further experiments (SI00387716 and SI00387730, Qiagen). The selected FLI1 siRNAs, or non-targeting control siRNA (Ctrl-control_1, SI03650325, Qiagen) were transfected into cells using HiPerFect Transfection Reagent (Cat. 301704, Qiagen). The final siRNA concentration was 10 nM. After 24 h, a second identical transfection was performed, and cells were harvested 48 h later for RNA isolation.
Before expression profiling, the efficiency of downregulation of the target gene and a set of known target genes were validated using real-time PCR (Supplementary Figure S10 and not shown). Expression profiling was performed using Affymetrix human genome U133plus2.0 arrays. A detailed description of array data analysis (Smyth, 2004; Wu et al, 2004; Falcon and Gentleman, 2007) is provided in the Supplementary data.
For ChIP experiments, enrichments of immunoprecipitated DNA were analysed by Roche LightCycler and Power SYBR Green Master Mix (Applied Biosystems). Relative enrichment of target DNA fragments was determined by calculating the immunoprecipitation efficiency above fragment-specific background (IgG control) followed by normalization to the occupancy level observed in control regions (see Supplementary Table S6 for primers and control regions used).
For expression analysis of RNAi knockdown efficiency, total RNA was reverse transcribed to cDNA using the High Capacity cDNA RT Kit (ABI), using 500 ng total RNA in a 20-μl reaction. The reactions were diluted 10-fold, and gene expression levels were determined from 1 to 3 μl of the reactions using qPCR as described above. Each gene was analysed at least in triplicate and normalized against endogenous β-actin control.
Since the publication of this paper, the authors have noticed an omission in the Supplementary Information. They now provide the accession numbers for the ChIP-seq and microarray data in the replacement Supplementary Figures and Methods. There was also some corruption of the word formatted files and these have been replaced as xls files. The files were corrected on 7 July 2010
We thank Ritva Nurmi and Sini Miettinen for technical assistance, and EMBL gene core, Uppsala SNP platform and Dr Outi Monni for massively parallel sequencing. This work was supported by the EU FP6 STREP project REGULATORY GENOMICS, Sigrid Jusélius Foundation, Finnish Cancer Organizations, Biocentrum Helsinki, Academy of Finland Center of Excellence in Translational Genome-Scale Biology, and the NEURO program of the Academy of Finland. MFB and MLB were supported by grant R01 HG003985 from NIH/NHGRI to MLB.
The authors declare that they have no conflict of interest.