|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide analysis of single nucleotide polymorphism (SNP) markers is an extremely efficient means for genetic mapping of mutations or traits in mice. However, this approach often defines a relatively large recombinant interval. To facilitate the refinement of this interval, we developed the program SNP2RFLP. This program can be used to identify region-specific SNPs in which the polymorphic nucleotide creates a restriction fragment length polymorphism (RFLP) that can be readily assayed at the benchtop using restriction enzyme digestion of SNP-containing PCR products. The program permits user-defined queries that maximize the informative markers for a particular application. This facilitates fine-mapping in a region containing a mutation of interest, which should prove valuable to the mouse genetics community. SNP2RFLP and further details are publicly available at http://genetics.bwh.harvard.edu/snp2rflp/
The positional cloning and characterization of mutations in the mouse is a powerful means for functional annotation of the mammalian genome. Many mouse gene mutations cause phenotypes that serve as models of human genetic disorders. Mapping and positional cloning of these potentially accelerate our understanding of the mouse gene, its human ortholog, and the underlying etiology of the disorder. The utilization of single nucleotide polymorphism (SNP) markers has markedly facilitated genetic mapping because they are abundant throughout the genome and can be analyzed in a high-throughput manner using automated technology (Wang et al. 1998). However, mutation mapping analysis using a genome-wide SNP panel does not generally yield high-resolution localization (Moran et al. 2006), and “benchtop” technologies for fine-mapping using SNPs and microsatellite markers are often inefficient. We have developed a web-based tool we call SNP2RFLP, which can extract region-specific SNPs from the dbSNP database (Sherry et al. 1999) and identify those SNPs that would create restriction fragment length polymorphisms (RFLPs) when assayed by restriction enzyme digestion of SNP-containing PCR products. The input to SNP2RFLP is the two mouse strains used in the cross, the chromosomal region, and a user-defined set of restriction endonucleases. SNP2RFLP extracts the SNPs from dbSNP that are polymorphic between the two strains in the region in question. The program simulates a restriction digest of the SNP-containing sequences with each enzyme to determine whether the SNP creates an RFLP. Informative markers are then analyzed using Primer3 (Rozen and Skaletsky 2000), which finds suitable PCR primers surrounding the SNP. The output of SNP2RFLP is the informative SNPs that create RFLPs and the forward and reverse PCR primers. This information can then be used to readily perform the RFLP assays and further refine the region containing the mutation of interest.
A local PostgreSQL database was constructed to hold all mouse SNPs from the NCBI dbSNP (Mouse Build 126) along with their flanking sequences. The database contains 8 million unique mouse SNPs, with 200–400 bp of flanking sequence for each SNP. SNP-containing flanking sequences were analyzed by Primer3, which identifies optimal PCR primers surrounding each SNP that meet standardized criteria for product size, primer melting temperature (Tm) (~60°C), and GC content (~50%) (Rozen and Skaletsky 2000). These forward and reverse primers are stored in the database along with each SNP.
There are 68 million known strain genotypes for the SNPs in the database, which holds genotype data for 99 different mouse strains. Seventeen strains, including A/J, DBA/2 J, 129S1/SvlmJ, C3H/HeJ, BALB/cByJ, AKR/J, NZW/LacJ, CAST/EiJ, BTBR T + tf/J, WSB/EiJ, FVB/NJ, NOD/LTJ, KK/HIJ, PWD/PhJ, MOLF/EiJ, C57BL/6 J, and 129X1/SvJ, were interrogated using a high-density array and each has approximately 2–6 million SNP genotypes (Sherry et al. 1999). The other 82 strains have only on the order of hundreds or thousands of SNP genotypes.
Restriction digest simulation is done by scanning each SNP-containing sequence for the recognition sites of select restriction enzymes. A SNP is considered to result in an informative RFLP assay if an enzyme site is found in the sequence of one strain but not in the other strain due to the alteration of the restriction site by the polymorphism. The default enzymes are AluI, AflII, ClaI, DdeI, EcoRV, Fnu4HI, HaeIII, HhaI, HinfI, KpnI, MboI, MseI, MspI, PstI, PvuI, PvuII, RsaI, SacII, SalI, ScaI, ScrFI, and Sau96I. This list comprises efficient, frequently cutting restriction enzymes that have a high probability of providing a robust RFLP assay for any given SNP. In addition, the user can select an option that includes all the enzymes in the simulated restriction digest. Analysis of the number of restriction enzyme sites within a given amplicon is performed to avoid assays with very high complexity or very small size differences of restriction fragments. All the restriction enzymes and recognition sequences used by SNP2RFLP were obtained from the restriction enzyme database (REBASE) (Roberts et al. 2003).
To avoid nonspecific amplification for a given SNP, the surrounding sequence for each SNP was queried for the presence of known repetitive elements and simple and complex repeats using RepeatMasker, which “masks” these sequences with “N”s (http://www.repeatmasker.org/). The genomic locations of these premasked sequences are stored with each SNP so the user can decide whether to discard SNPs that fall in repeat regions.
The input to SNP2RFLP is the two mouse strains used in the cross, the chromosomal region (as defined by base pairs), and a set of restriction endonucleases. A default list of 22 commonly used restriction endonucleases with frequently occurring recognition sites is used by SNP2RFLP to simulate restriction digestion, but additional enzymes can be selected from a list of 1300 endonucleases.
SNP2RFLP extracts the SNPs from dbSNP that are polymorphic between the two strains in the region in question. SNP2RFLP then simulates a restriction digest on the SNP-containing sequences with each enzyme that was selected to determine if the SNP is contained within one or more enzyme recognition sites and creates an RFLP. That is, a SNP-containing sequence is scanned to see if the recognition sequence for any particular enzyme contains the SNP and is found for one strain but not the other due to the alteration of the recognition sequence by the SNP. If this is the case, the SNP is considered informative because the alleles can be distinguished by amplifying the region with PCR, digesting the products with the enzyme, and examining the resulting restriction pattern after agarose gel electrophoresis of the digested product (Fig. 1). Informative SNPs are listed and are accompanied by suggested oligonucleotide primer sequences for PCR amplification of the SNP (extracted from data stored in the database for each SNP), the position of the primers with respect to the SNP, and the number of enzyme recognition sites present in the amplified sequence. The entirety of these data can be visualized as a web-based display (Fig. 2) or can be exported as a spreadsheet document.
We have used the SNP2RFLP service to assist in developing markers in our mapping of mutants in an ongoing ENU mutagenesis screen. In the process of mapping seven different recessive mutations, we have utilized 32 different RFLPs which are summarized in Table 1. We used primarily SNP2RFLPs identified with the default enzyme set. Twenty-three of these yielded easily interpreted results when digested with the prescribed enzyme, although three required additional optimization. Nine assays were not usable for mapping purposes: One did not give a discrete PCR product and eight assays failed to detect the RFLP as predicted by this program. Overall, this program yielded easily implemented assays with 72% reliability.
Because the number of characterized SNPs and their distribution across the genome are highly variable, we have incorporated multiple options to control the output returned by SNP2RFLP to produce a useful number of informative markers. First, each SNP in the database has a validation status. NCBI’s dbSNP defines many different ways that a SNP can be validated. For simplicity, if the “display validated SNPs only” option is selected, SNPs that have no validation information are excluded. This reduces the number of informative SNPs in many cases, but gives higher confidence in the utility of those reported.
Second, there are occasions when no informative SNPs are reported between two strains in a specific region. It may be that there are indeed informative SNPs in the region but the genotype may be recorded in only one strain. If the “display SNPs recorded in only one strain” option is selected, then the restriction digest is simulated on the SNPs from the strain where the genotype is known and compared with that for the alternate allele. For those identified as potentially polymorphic, additional methods can be applied by the user to infer the genotype of each SNP in the other strain (such as comparison to haplotypes of well-characterized strains), or they can simply be empirically tested.
Third, as previously noted, SNPs in a repeat region of the genome are often difficult to amplify. The user can select an option for SNP2RFLP to discard SNPs that fall within repeats.
Finally, the desired density of SNP markers returned can be set. SNP2RFLP can be instructed to return all of the informative markers or a subset (e.g., 1 of every 5, 1 of every 10, etc.). This is an extremely valuable option that allows the user to retrieve an adequate and manageable number of markers. As an example, suppose a genome-wide SNP scan, crossing A/J and FVB/NJ, reveals a candidate region on chromosome 13, 14.8–46.7 Mb, in which a particular mutation of interest may be located. Table 2 gives the different numbers of informative SNPs in this region returned by SNP2RFLP by selecting different options in the program. When selecting all of the enzymes and instructing SNP2RFLP to keep all SNPs, SNP2RFLP returns a large and perhaps unmanageable number of SNP markers. By restricting the type and number of enzymes and directing SNP2RFLP to report SNPs at a desired density, a manageable and adequate number of SNP markers can be considered that will facilitate fine-mapping of this candidate region that contains the phenotype-causing mutation of interest.
The SNP2RFLP interface can be accessed through the web at http://genetics.bwh.harvard.edu/snp2rflp. An instruction manual is available on the website or as supplementary data.
Wesley A. Beckstead, Department of Biology, Brigham Young University, Provo, UT 84602, USA.
Bryan C. Bjork, Genetics Division, Brigham and Women’s Hospital, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.
Rolf W. Stottmann, Genetics Division, Brigham and Women’s Hospital, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.
Shamil Sunyaev, Genetics Division, Brigham and Women’s Hospital, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.
David R. Beier, Genetics Division, Brigham and Women’s Hospital, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.