|Home | About | Journals | Submit | Contact Us | Français|
Genomic SELEX is a discovery tool for genomic aptamers, which are genomically encoded functional domains in nucleic acid molecules that recognize and bind specific ligands. When combined with genomic libraries and using RNA-binding proteins as baits, Genomic SELEX used with high-throughput sequencing enables the discovery of genomic RNA aptamers and the identification of RNA–protein interaction networks. Here we describe how to construct and analyze genomic libraries, how to choose baits for selections, how to perform the selection procedure and finally how to analyze the enriched sequences derived from deep sequencing. As a control procedure, we recommend performing a “Neutral” SELEX experiment in parallel to the selection, omitting the selection step. This control experiment provides a background signal for comparison with the positively selected pool. We also recommend deep sequencing the initial library in order to facilitate the final in silico analysis of enrichment with respect to the initial levels. Counter selection procedures, using modified or inactive baits, allow strengthening the binding specificity of the winning selected sequences.
Genomic aptamers are defined as functional domains within genomically encoded RNA or DNA molecules that recognize and bind ligands such as various small molecules, metal ions or proteins. The term “aptamer” was initially used for short synthetic RNA molecules, which were selected in vitro starting from pools of random sequences . With the discovery of naturally occurring aptamers within bacterial riboswitches, the term aptamer was extended and now includes all the metabolite-binding domains within these regulatory modules . The term “genomic aptamer” further includes a very broad spectrum of regulatory domains within RNA molecules, which sense signals as diverse as temperature, salts, metabolites, peptides and proteins. These mostly structured domains also recognize target RNAs, RNA polymerases, ribosomes, spliceomes and many other components of the gene expression machinery . It is very probable that many more genomic aptamers with affinity for very diverse ligands exist. As a consequence, it became necessary to develop methods for the de novo discovery of aptamers encoded within genomes.
“Systematic Evolution of Ligands with EXponential enrichment”, or SELEX, is a method that was used in the 90’s to analyze RNA–protein or DNA–protein interactions [4,5]. SELEX was originally developed for the isolation of nucleic acid molecules with high-affinity to small molecules such as amino acids, enzyme cofactors, antibiotics or proteins . In RNA aptamer selections, a DNA pool containing random sequences is typically synthesized artificially, which are subsequently transcribed to give the starting RNA population. The major difference between SELEX and Genomic SELEX is the starting pool used for the selection. SELEX begins with a library of synthetically derived random DNA molecules. Genomic SELEX starts from libraries derived from genomic DNA. As a consequence, only naturally occurring aptamers encoded in the screened genome will be identified. High-affinity binding RNAs are enriched from the initial pool through multiple rounds of binding of the RNAs to a given ligand (Fig. 1). The winning molecules can be cloned and sequenced individually if only a limited number of sequences are expected. However, depending on the bait used for the selection, a large number of sequences can be expected, making high-throughput sequencing the method of choice.
In the past decades, many functional RNAs have been discovered serendipitously or via strategic screens. The strategies employed for the discovery of functional RNAs most often involved computational predictions with subsequent demonstration of the existence of the predicted RNAs, or experimental approaches including massive sequencing of cDNAs or genetic screens [6–8]. These approaches have been very successful, but they have their limitations. Computational predictions rely primarily on conservation and structural stability as signal for an active molecule, thus constraining the range of possible predictions . Sequencing of cDNAs requires the searched RNAs be expressed during the conditions used for RNA extraction. Genomic SELEX, combined with high-throughput sequencing, can bypass both limitations and allows the discovery of new types of RNA molecules independent of their expression levels. The availability of genomic sequences and high-throughput sequencing transforms this technique into a discovery tool for genomic aptamers.
Genomic SELEX is both alternative and complementary to in vivo and in silico approaches. When using Genomic SELEX, the enriched sequences will not represent a full size functional RNA, but the ligand-binding domain, or “aptamer”, within the putative transcript. Genomic SELEX is most useful when searching for RNAs that might be seldom expressed, such as transcripts from silenced domains in heterochromatic regions of the genome or RNAs that are only transiently expressed during distinct phases of development. It is also often used to detect DNA sequences recognized, for example, by transcription factors . Additionally, SELEX was applied to select RNA aptamers targeting RNA molecules to foster the understanding of RNA–RNA interactions. Interestingly, the most competitive aptamers were not binding through sense/antisense pairings, but instead through 3D, non-Watson–Crick interactions [11,12]. Genomic SELEX is the method of choice for screening entire genomes for DNA or RNA aptamers that bind to global regulator proteins, or any other ligands that are common targets for cellular nucleic acid interactions.
The success of a Genomic SELEX experiment depends on the choice of bait. These can be small molecules or peptides or larger proteins. The most common baits used for Genomic SELEX are nucleic acid-binding proteins with diverse affinities and specificities. Promoter regions have often been characterized by performing in vitro selection with DNA-binding transcription factors . Recently, though, new roles of RNA at all levels of gene regulation are rapidly being discovered, and we are aware of only few targets of RNA-binding proteins. Some examples which have been associated with regulatory non-coding RNAs include proteins that are involved in chromatin remodeling, transcriptional and post-transcriptional silencing, and components of machines that participate in RNA processing, transport and translation [14–17]. The use of these proteins as bait truly exploits the potential of Genomic SELEX to analyze RNA–protein interaction networks.
For the discovery of novel riboswitch aptamers via SELEX, the choice of bait, and especially the immobilization method, might pose problems. The high-resolution structures of several riboswitch aptamers in complex with their ligands revealed that these metabolites are completely surrounded by the RNA making it almost impossible to immobilize the ligand via a linker . This has to be taken into account when using small molecules as bait.
In this section, we will outline the procedure and point out some of the crucial points for a successful Genomic SELEX screen. For a detailed, step-by-step protocol, see the Lorenz et al. Nature Protocol  and for more information on library construction, see Singer et al. .
As discussed in 1.4, different types of proteins can be used as bait in Genomic SELEX. Since Genomic SELEX is a method designed to identify aptamers that are specifically bound by a given bait, its purity is of crucial importance. Enrichment of aptamers binding any contaminants is to be avoided. Hence, special effort should be put into preparation of the pure protein. The protein of interest can be used in full-length or in mutated forms.
Translational fusion to different tags (e.g. His, Flag or GST) can facilitate purification of the bait and offer alternative approaches to the selection step. In this case, however, additional considerations in terms of controls and counter selection have to be taken into account in order to avoid artifactual enrichment. Another important aspect of protein purification is maintenance of its activity. Thus, appropriate storage and binding buffer that also allow proper RNA folding must be utilized. In most cases, it is desirable to have the protein in its active conformation. Therefore, assays to verify its activity in the buffer conditions given should be performed, when available. If no information of the required buffer for your protein of interest is available, near physiological conditions represent a safe starting point.
The initial stage of preparing the library for Genomic SELEX is obtaining genomic DNA. The organism encoding for potential targets is selected depending on bait and physiology context. However, to allow mapping and further analysis of the many sequences resulting from high-throughput sequencing, it is important to use DNA of an organism or a strain that is fully sequenced. A number of protocols for the isolation of genomic DNA applicable to different organisms are established and can be used for this purpose. In addition, genomic DNA for commonly used model organisms is nowadays commercially available and can be used, if preferred. In general, any source or desired protocol can be used for genomic DNA isolation, as long as the resulting DNA is of high quality and reliably represents the genome of interest.
In Genomic SELEX, two pairs of primers called “hyb”- and “fix”-primers are used (Fig. 2A). When designing primers for SELEX, standard guidelines should be followed: primer dimer formation and self-complementarity should be avoided, and the melting temperatures of the primers should be similar. Additionally, primer design for SELEX requires somewhat special considerations that are discussed below.
The first part of the library construction consists of first- and second-strand Klenow synthesis, in which hybREV and hybFOR primers are used, respectively. Both hyb-primers consist of a unique constant sequence region that is absent in the genome, followed by ~9 randomized nucleotides at the 3′ terminus (Fig. 2). These randomized regions serve to pick random genomic regions for inclusion in the library. In order to amplify the library, fix-primers are used. The fix-primers correspond perfectly to the 5′ constant regions of respective hyb-primers (18nt FOR, 17nt REV), with addition of the T7 promoter at the 5′ end of the fixFOR primer.
Introduction of flanking sequences can potentially lead to artifacts during the selection procedure. When using short libraries, the flanking primers can become part of the structural motif of the aptamer. This issue was addressed by Shtatland et al. . Oligonucleotides complementary to flanking regions can be annealed prior to the selection step, thereby blocking the formation of an artificial structure. The same work describes another approach of switching flanking sequences every few SELEX cycles. Alternatively, primer-free Genomic SELEX was developed in which primers are removed prior to each selection step and reintroduced prior to the amplification step . Such alternatives of SELEX complicate and prolong the process, and can be excessive. However, in an attempt to select for short binding motifs (20–40nt), we strongly recommend using this expanded protocol, since selection for shorter sequences may increase the chances of incorporating flanking regions as a part of the binding motif.
Several factors can contribute to differences between the recovered high-throughput sequences and the reference genomic sequence, including genotype, PCR error and sequencing error. For this reason, correct primer sequences flanking the insert are essential for filtering correct, full-length sequences. The protocol by Lorenz et al.  shows hyb-primers containing the restriction sites for switching the flanking sequence described by Shtatland et al.  (see previous paragraph), however these are not included in the fix-primers. This lead to pronounced degeneracy of the recognition sequences and complications of the analysis of our deep sequencing result. Thus, we recommend using fix-primers that include the entire fixed region of the hyb-primers, to ensure the fidelity of the final primers and proper alignment of the whole insert.
In Genomic SELEX, the primary aim of constructing a genomic library is to give every potential genomic aptamer region equal candidacy for selection, and therefore have complete coverage of the genome and amplify it evenly. Additionally, the lengths of the fragments should correspond to those of the type of genomic aptamers being targeted. For example, in an attempt to identify novel non-coding RNAs of Escherichia coli binding to Hfq, Lorenz and colleagues aimed for selection of molecules ranging from 50 to 500nt in size, lengths corresponding to the range of known ncRNAs in E. coli .
After obtaining high quality genomic DNA, Klenow reactions follow (Fig. 2). The first strand is synthesized with the hybREV primer, which is annealed to the genomic DNA, preceding the extension reaction. In order to minimize the number of fragments flanked with the same sequence on both sides, the excess of hybREV should be thoroughly removed prior to second-strand synthesis with hybFOR. Size selection follows, according to the desired fragment lengths, usually via extraction of fragments of the desired size range from polyacrylamide gels. Preparation of the genomic library finishes with the introduction of the T7 promoter or other desirable promoter through PCR amplification using the fix-primers.
The genomic library used in Genomic SELEX should be representative of the genome. It is therefore advisable to test the quality of the library prior to the first SELEX cycle by performing a PCR with several primer pairs giving amplicons that correspond to the length of fragments in the library (Fig. 3A). The presence of any genomic region can be tested, however single-copy regions are easiest to evaluate. In addition, size variation can be checked by performing PCR using fixREV or fixFOR in combination with a gene specific primer (Fig. 3B and C). In a high quality library, such PCRs should result in a set of products covering a large range of sizes with single-nucleotide resolution. The library quality can also be confirmed by high-throughput sequencing and should be sequenced together with the enriched pools, as discussed later.
Some artifacts from the library construction will arise. For one, while the length distribution is ostensibly determined by the gel excision step, it should be noted that the fragments become shorter with subsequent rounds of selection. In the above-mentioned selection for Hfq-binding aptamers, the final pool averaged 65nt in length, owing to the fact that it is faster to amplify shorter inserts . Additionally, since genomes are not random, random priming will result in overrepresentation of regions whose flanking n-mers (n representing the length of the random region) are less frequent in the genome.
We also found in our E. coli genomic library that the in silico predicted melting temperatures were on average slightly lower than random primers, irrespective of the genome content (Fig. 4). This could be due to the preferred incorporation of more stable DNA–DNA interactions. As a result of biased priming, we saw differences in nucleotide content and stability of sequences as compared to the genomic average. The data nonetheless indicate that the genome was completely covered, although some regions were better covered than others. Therefore, in order for the results to have demonstrative significance, only thoroughly enriched sequences should be considered, and the enrichment should be measured as relative to the genomic library.
The first step of each round of Genomic SELEX is the construction of the RNA pool by in vitro transcription of the DNA library. This should be done with one radiolabeled nucleotide to track the quantities of RNA before and after selection. It is crucial that the template DNA is eliminated by thorough DNase treatment to avoid selection and subsequent enrichment of DNA fragments. Next, the selection, or binding step, follows. The exact condition in which binding is performed varies and is to be optimized for each protein. Moreover, it is possible to aim for a certain binding affinity through introduction of a known binder as a competitor, varying the RNA-to-protein ratio (10:1 is a good starting point), time of incubation and temperature of the binding reaction. The salt concentration of the buffer should still maintain proper activity of the bait protein.
Formed RNA–protein complexes are usually separated from unbound species through nitrocellulose membrane filtration [19,21]. There are also other methods to separate complexes and unbound nucleic acids, such as chromatographic separation or gel shift assays . An antibody can also be used to immobilize the bait. Similarly, an immobilization matrix coupled to an appropriate ligand can be used to separate complexes, if the bait protein is fused to a tag (Olga Bannikova and Andrea Barta, pers. comm.).
Following each round of selection, the recovery rate of RNA should be measured with scintillation counting . The amount of RNA recovered is an indicator of enrichment of binding species in the pool. If very competitive aptamers are desired, the stringency should be increased, for example by lowering the concentration of both RNA and protein in the buffer. As shown in Fig. 5, increasing stringency will not only homogenize the pool, but also provide a pronounced stratification of the binding species.
To avoid recovery of any unspecific aptamers, it is beneficial to introduce a counter selection step prior to the selection step. This refers to preclearing of the nucleic acid pool by applying it to the immobilization matrix (e.g. nitrocellulose, column, matrix coupled to ligands) to avoid enriching nucleic acid sequences that bind to the matrix. Additionally, performing a counter selection against inactive forms of the bait enables the detection of more specific aptamers and to increase the discriminatory potential of aptamers (see theophilline, streptomycin [25,26]). Counter selection steps will be useful when analyzing antagonistic baits.
After the separation of binding and non-binding sequences, the recovered RNAs are reverse transcribed. To minimize the loss of structured aptamers in pools, initial denaturation step via melting should not be omitted and the reaction should be performed at elevated temperatures. Accordingly, the enzyme of choice should maintain its activity at temperatures between 50 and 60 °C. If the experimental design suggests selection of longer aptamers, reverse transcription conditions should be adjusted in such way that the synthesis of long and structured cDNA is possible and effective.
Each cycle of SELEX concludes with the amplification of the selected pool using the fixREV and fixFOR primers. PCR can, due to imperfect fidelity of the polymerase, introduce mutations which can, in further selection steps, be favored over initial sequences . Enrichment of such artificial aptamers should be avoided in Genomic SELEX. Therefore, in order to minimize artificially introduced divergence among pools of different cycles, it is recommended that the polymerase has pronounced proofreading activity. In order to obtain a product amount optimal for subsequent cycles of SELEX, 7–10 cycles of PCR are recommended. Too many cycles should be avoided because the heterogeneity of the pool can result in dimerization of incomplete products when the primer-to-template ratio decreases, resulting in extended, chimerical products.
It is highly desirable to perform parallel repetitions of Genomic SELEX using the same selection method, and possibly several different methods of selection and immobilization of nucleic acid–protein complexes. The intersection of these multiple data sets will result in a more reliable aptamer pool. Optionally, parallel screens can be performed using different mutant forms of baits.
It was shown that the amplification steps of SELEX (in vitro transcription, reverse transcription and PCR, that is, everything but selection itself) introduce some additional requirements on the sequences . These constraints were evaluated by performing a Neutral SELEX experiment, in which the selection steps are excluded. Beginning with the same E. coli genomic library used to select for Hfq-binding sequences , 10 rounds of Neutral SELEX were performed, and the pools from each round were subjected to 454 sequencing. In comparison to the initial library, the sequences shifted towards higher free energy, suggesting that highly structured RNAs are at a disadvantage. There was also a trend towards a decrease in guanosines and an increase in adenosines. This experiment can easily be performed in parallel to any Genomic SELEX experiment in order to facilitate the subsequent analysis of signal vs. noise in the sequences.
When performing Genomic SELEX with global regulator proteins, the final pool should contain genomic aptamers originating from many regions of the genome. The most interesting readouts of the experiment are the levels of enrichment of each aptamer, and how they compare to that of the initial pool. In order to fully accomplish this, high-throughput sequencing is necessary. Given that this is a pricey undertaking, it is best to first clone and sequence some of the pool to ensure that the pool is diverse (i.e. the strongest binder did not compete away all other aptamers) and the sequences were not significantly contaminated. If the final pool appears too homogenous, a previous round of selection may suffice.
In addition to sequencing the final pool, it can be useful to sequence the initial genomic library. This information, along with sequence data from Neutral SELEX (see Section 2.3), can be used directly to determine the background nucleotide distributions and possible biases towards enrichment at particular positions or sequence regions (repetitive sequences, for example) .
454 sequencing has been used , with the primary advantage over Illumina sequencing being its ability to recover full-length sequences . The common approach to the Illumina short read problem is to find pairs of “peaks”, regions containing several short reads, presumably representing the ends of an experimental sequence. When a natural transcript is pulled down from an in vivo situation, one can annotate the peaks with the nearest known relevant genomic feature [28,29]. This is unsatisfactory for Genomic SELEX, since the sequences are not originating from natural transcripts. Additionally, many full-length sequences will have varying endpoints, which can provide interesting analyses, e.g. in silico boundary mapping of minimal required motifs (Fig. 6A). However, newer Illumina machines now support in situ reversal of templates to generate paired-end reads, potentially allowing full-length annotation of sequences in the analysis, offering orders-of-magnitude higher coverage at a good price.
Mapping of high-throughput sequences to the corresponding genome should be done with a suffix array- or Burows–Wheeler-based aligner. These algorithms are both faster and more sensitive than the popular BLAT . vmatch  has been used , however, new short read alignment programs are being developed at a dizzying pace due to high demand. Many of these have been thoroughly reviewed by Flicek and Birney . Since 454 matches are of highly varying lengths, e-value matching statistics should be used to filter good matches. On the other hand, Illumina reads can be mapped without full alignment, for example with bowtie  or BWA . The impressive element of these last two programs is that there is no need for extensive compute power: a full lane can be mapped on a single desktop machine in hours.
Clusters of sequences should then be generated. The most straightforward approach is to include any sequence in a group, based on overlap. When ranking the sequences by enrichment, it may suffice to simply use the read count for each cluster, however, in cases where repetitive sequences are counted, or reads are not mapped uniquely, some thought needs to be put into the ranking. The best solution is to sequence and map the genomic library and/or Neutral SELEX pools and compare the relative enrichment of the aptamers.
Specific RNA-binding is dependent on both sequence and structure of the folded molecule. Interactions can often take place as a result of direct hydrogen bonding with Watson–Crick groups of the aptamer with the ligand. Motif finding can provide insight into whether the SELEX pool has been enriched as the result of such sequence recognition. The problem of motif finding has been given quite a bit of attention for decades, however, motif finding remains a complicated task . Aside from the minutia of parameter tweaking, most algorithms are not designed to handle high-throughput quantities of sequences. Therefore, it is only practical to use the clustered sequences, and furthermore to limit the sequences to ones showing some enrichment. Most motif finders also allow the input of weights for each of the sequences, and absolute enrichment would be a good way to account for overrepresentation of motifs.
Another interesting statistic of genomic aptamers is their location with respect to genomic features. It is also important to take into account enrichment level. However, enrichment profiles of clusters are not always even. An outlying sequence can overlap a cluster by chance, extending it into a perhaps insignificant region of the sequence. A good trick to speeding up these analyses is to include an “enrichment sequence” parallel to each cluster (sometimes called a “signal map”), indicating the number of reads aligned at each position of each cluster (Fig. 6). This can be helpful in other analyses, such as finding enrichment of regions surrounding gene starts and stops, or finding maximally enriched subsequences of clusters. When annotating the clusters, for example, it is best to count base-by-base the enrichment depth within each feature, and annotate with the feature containing the majority of the enrichment depth.
When comparing relative levels of enrichment for specific genes, gene types or repeats, normalization is essential, since it is more likely for a read to map to a longer gene by chance than a shorter one. A normalization analysis was developed for the RNASeq method [29,36]. This was further developed in the Cufflinks automated alignment to assembly program . It is not hard to envision extending this into feature types, such as introns, exons, repeat regions and so on.
While many analyses can be automatically run, visualization can be priceless (Fig. 6). The clusters can be neatly displayed in a genome browser, easiest being UCSC , however, a custom browser can be constructed relatively easily with a package such as GBROWSE [39,40].
To verify the biological significance of the sequences enriched by Genomic SELEX, further characterization of aptamers of interest follows. Since many novel transcripts will potentially be identified, thorough characterization of such transcripts using common biochemical methods is necessary. The main question is whether the isolated motifs are contained within expressed transcripts. This might be difficult to determine, as the period of expression might be difficult to find. It is helpful to test whether the sequences are conserved in related organisms and to search for promoters in the vicinity. If an antibody for the bait is available, co-immunoprecipitation is good to test their expression and interaction in vivo. To determine the precise interaction site between the selected aptamer and the bait, chemical probing and footprinting will give reliable results .
Genomic SELEX is a valuable approach when screening for RNA domains that are targeted by nucleic acid-binding proteins with a broad range of binding partners. In the ongoing campaign to discover the many roles of RNA in regulatory networks, we envision that this technique will be a key player. The crucial advantage of Genomic SELEX is that it enables the discovery of transiently expressed RNAs that may be either self-regulating or under tight regulation due to their potency. Discovering several of these genomic aptamers would require painstaking preparations and many expensive runs of high-throughput sequencing in an in vivo context, if they can be pulled down at all.
Performing Genomic SELEX with an artificial genomic library is essential to reap the in vitro advantage, because preparations of total RNA will have an expression-level bias. Importantly, though, this means that sequences pulled down are not naturally generated. For this reason, the combined use of Genomic SELEX with an in vivo analog such as HITS–CLIP  can be symbiotic: CLIP sequences that overlap SELEX genomic aptamers are immediately confirmed as an in vivo binding partner, and SELEX genomic aptamers that overlap CLIP sequences help determine the competitiveness of the binding and also a potential binding subsequence.
There have been an incredible number of optimizations to the SELEX protocol in recent years , and nearly all of them are applicable to the Genomic SELEX method. However, since Genomic SELEX drastically decreases the diversity of the initial library, many of these biases will not play such a critical role in selectivity. Future optimizations for Genomic SELEX will be most fruitful in experimenting with selection assays, such as alternative immobilization and additional counter selection techniques, as well as differential analysis of multiple parallel selections against related proteins.
Work in our laboratory is funded by the Austrian Science Fund FWF Grant Z-72 to R.S., the doctoral program in RNA Biology W1207 to B.Z. and the Austrian GEN-AU Project D-110421-11 to R.S.