Studying genetic variations in the human genome is important for the understanding of phenotypes, diseases, drug responsiveness and the mechanisms of complex traits (6
). For many applications, only a small part of the genome, such as specific genes or regulatory regions, are of interest (44
). The current methods for selected enrichment of genomic regions followed by next generation sequencing are based on PCR or hybridization approaches (15
). These methods encounter size limitations particularly to link variations separated by more than a few hundred base pairs, as well as limitations in duplicated and repetitive regions.
The recombineering strategy presented here is useful for targeted isolation of genomic regions in a vector format that allows for rapid adaptation to functional analysis based on gene targeting (27
) or transgenesis (30
). A similar approach to isolate genomic regions in BACs has been published recently (46
). We use fosmids, because they are easy to handle, stable, suitable for genomic structural variation studies (2
) and preparation of targeting constructs. Most importantly, compared to BAC libraries, fosmid library construction requires much less genomic DNA, which is a major consideration when the source of DNA is a patient sample.
To increase the targeting efficiency and thereby the complexity of the pools from which a specific region can be retrieved, we engineered a new strain that allows for switching from unidirectional to bidirectional fosmid replication. In that way, we exploit an additional increase in recombineering efficiency due to increased fosmid copy-number after TrfA induction. This improved the isolation of genomic regions of choice from complex fosmid pools. The very low levels of illegitimate recombination reduced the need to screen through a large number of clones to obtain the desired region. The number of recombinants varied between the captured loci, possibly reflecting the different replication speeds of the individual clones within the pools. Variability in the number of recombinants for several E. coli
chromosomal locations has previously been correlated with the rate of replication of the regions (26
Previously a method to screen genomic libraries by recombineering was reported (47
). However, this method does not appear to have been subsequently utilized, possibly because the complex counter selection strategy imposed practical difficulties. Similarly our previous experience with genomic cloning by recombineering (25
), indicated certain practical limits to lambda Red recombination in complex backgrounds. Hence, we adapted a recombineering method to optimally sized pools of cloned genomic regions.
Fine-tuning the expression levels of the recombineering proteins not only improved the recovery of target clones but also likely contributed to the successful isolation of intact, highly repetitive, regions. Indeed, previous work has shown that overexpression of Redγ from a plasmid can increase the total number of colonies, but the frequency of correct recombinant BACs was low (48
). Transient RecA co-expression from a plasmid has been previously shown to enhance the total number of colonies surviving electroporation (32
), but leaky expression of RecA could cause increased basal levels of unintended intramolecular rearrangements. That is why we expressed RecA from the genome, together with the Red operon, using the tightly controlled PRha
The extent of variation within human genomes is now being revealed by SNP maps and massively parallel sequencing (1–4
). However, knowledge about the ‘haplotype phasing’ in different genomes has been scarce (8
). Two recently published methods for genome-wide resolution of the haplotypes (49
) pave the way to systematically study haplotype phasing in individual genomes and cell lines. Our approach is complementary to these studies and allows for the determination of SNP linkage and therefore the disease susceptibility throughout the selected regions covered by fosmid clones. Thereby, we reconstructed haplotypes at loci from chromosome 6, X and 10 from the H7 hES cell line. Comparative analysis between the H7 and Shef4 OCT4 haplotypes revealed differences in 12 SNP positions and most of the identified indels were cell line specific (13 of 16). These variations were found in more than one independent clone and therefore represent true polymorphisms of the cell lines.
Whole-genome sequencing shows that structural variations smaller than 50
kb account for the large portion of polymorphism identified in individual human genomes (1
). Most of these events are enriched near or in repeated and segmental duplicated regions and difficulties to resolve them have been reported by different investigators (5
). Using the targeted retrieval of clones, we were able to distinguish between highly similar sequences like NANOG and its pseudogene NANOG P1. Once isolated, such regions can be further characterized by sequencing at very high depth. This allows the description of their polymorphisms at single nucleotide resolution.
Exploring the impact of the mutations and their characterization as benign or disease associated can be achieved through gene targeting in stem cells (51
) with isogenic constructs. Our approach permits generation of such constructs with personal genome specific combination of variations. The isogenicity of the flanking homologous sequences is an important issue. First, it could promote the targeting efficiency in human ES cells as was shown for mouse ES cells (48
). Second, bearing in mind that SNPs may influence transcription factor binding and gene expression (9
), targeting with isogenic vectors should not disturb the existing genomic context. This will be useful for gene editing in stem cell-based therapies.
We identified two novel allele-specific SNPs located in regulatory regions on one of the X chromosome in the H7 cell line at the MECP2/IRAK1 loci. The biological significance of these polymorphisms is not known. The whole-genome ENCODE analysis on the male H1 hES cell line indicates that the two SNPs are located in an enhancer and a promoter where c-Myc and Pol2 bind, respectively. The SNPs are in CpG dinucleotides thus they may influence the binding of regulatory proteins or the methylation status of the two alleles.
The high fidelity of Red/ET recombineering demonstrated in this and previous studies allows the further scale up of the method to high-throughput liquid format (30
) for simultaneous isolation of multiple loci. For example, the method can be used to develop screening assays for isolation of regions affected by mobilized retrotransposons or other repetitive elements in personal genomes. Recently, numerous novel active retrotransposons were identified in the human genome (12
). Although they are underrepresented in the reference sequence, they exist at low allele frequencies in the population and can be a source for disease-producing insertions.
This method can also simplify the acquisition of DNA regions from model organisms or metagenomic studies of environmental samples. The approach is straight forward and does not require any special equipment or complicated computational analysis. Because it is flexible with many potential applications, we recommend it to a wide range of researchers.