Figure shows a simplified flow chart describing the primer selection algorithm. The algorithm was implemented in Perl as a program called uPrimer. uPrimer requires ~2 days on a 1.5 GHz Linux system to design primers for human or mouse genes.
A simplified flow chart describing the primer design algorithm.
The principle source of gene sequence information for this project is the NCBI protein database GenPept (http://www.ncbi.nlm.nih.gov/Entrez/
). The corresponding DNA coding sequences were retrieved and redundant sequences were clustered using a program called DeRedund (8
). Low complexity regions may contribute to primer cross-reactivity (15
) and thus are excluded by the DUST program (16
). To further enhance sequence complexity, a primer sequence is rejected if it contains six or more contiguous identical residues and no primer candidate is considered from sequence regions with ambiguous residues.
Two kinds of priming reactions are commonly used in RT reactions: random priming and oligo(dT) priming. Oligo(dT) priming usually results in cDNA libraries enriched for mRNA and tends to over-represent the 3′ ends of transcripts. As the detection of different splice isoforms is one major goal in gene expression analysis, we expect to perform random priming in RT reactions. In general, maximum sensitivity in random priming lies close to the 5′ end of a coding sequence (8
). Therefore coding regions were scanned from the 5′ end to the 3′ end until three qualified primer pairs had been picked.
To facilitate the conduction of multiple PCRs, all human and mouse primers are designed to have similar properties. All primers are 19–23 nt long, with a preferred length of 21 residues. This is long enough to permit generation of gene-specific primers, while reducing the potential for cross-reactivity and allowing cost-effective generation of large primer sets. The GC contents are also similar (35–65%) to ensure uniform priming. Because 3′ end residues contribute most to non-specific primer extension, especially if the binding of these residues is relatively stable (17
), the algorithm evaluates the ΔG
value for the last five residues at the 3′ end and a threshold value of –9 kcal/mol is adopted for primer rejection.
The melting temperature (Tm
) determines the optimal annealing temperature. In recent years, significant progress has been made to accurately estimate the Tm
of oligonucleotides (18
). The nearest neighbor method is to date the most accurate approach and is implemented by the following formula (18
Tm = ΔH°/[ΔS°–R ln(CT/4)]
where R is the gas constant (1.987 cal/Kmol), CT
is the primer concentration, ΔH
° is the enthalpy change and ΔS
° is the entropy change. ΔH
° and ΔS
° are calculated by using the published thermodynamic parameters (18
). The entropy change is dependent on salt concentration, so an entropy correction is performed:
ΔS° = ΔS° (1 M Na+) + 0.368 × (N – 1) × ln[Naeq+],
is the length of the primer and [Naeq+
] is the Na+
equivalent concentration from all salts in a reaction. The default parameters for Tm
calculation are 250 nM primer and 0.15 M Naeq+
). Variations in primer and salt concentrations in other typical PCR conditions affect the Tm
values only slightly. All primer Tm
values are in the narrow range 60–63°C.
Since PCR efficiency is decreased for very long amplicons, only short amplicons of 150–350 bp are considered during primer selection. Occasionally, if this requirement cannot be satisfied, a wider range of 100–800 bp is used. In general the larger amplicons are less attractive but they are included in the database because under some circumstances primer efficiency may not be the foremost consideration for the end user.
Mismatches are known to significantly reduce priming stability (22
) and at times even a single mismatch can destabilize a significant length of DNA duplex (24
). Therefore we expect contiguous base pairing to be one of the most important factors in duplex stability. Our principal filter for cross-reactivity is the rejection of primers containing contiguous residues that are also found in other sequences. An analysis of the distribution of lengths of contiguous residues shared by two or more sequences in the design space of mammalian coding regions showed that a filter cut-off rejecting perfect 15mer matches was the most stringent feasible filter (8
). Non-unique 15mers can be efficiently identified by a software ‘hashing’ technique with 10mers as the basic hash keys (8
). Every possible 15mer in a primer sequence is compared to both strands of all known sequences in the design space. The presence of a repetitive 15mer excludes a primer from further consideration. To further reduce cross-reactivity, BLAST searches for primer sequence similarity were carried out against all known sequences in the design space and qualified primers were required to have BLAST scores of less than 30 [these threshold values were recommended from previous studies (8
Random priming in RT reactions results in a significant contribution of template from non-coding RNAs. To compensate for the abundance of these templates, more stringent filters were applied to minimize primer residues also found in non-coding RNAs.
The primer 3′ end residues are essential for controlling non-specific amplicons because DNA polymerase extension can be greatly reduced by mismatches (26
). Therefore a more stringent filter should apply to cross-hybridization at the 3′ ends. In our algorithm the cross-hybridizing Tm
for the 3′ end perfectly matched residues does not exceed 46°C; the Tm
does not exceed 42°C when compared to non-coding RNA sequences.
Secondary structure in the primer or target can retard primer annealing, leading to reduced PCR efficiency. Although the prediction of primer secondary structure is still challenging at present, secondary structure is most likely to occur in regions of self-complementarity (28
). To reduce self-complementarity, no contiguous 5mer match is allowed anywhere between a primer and its complementary sequence. To avoid picking primers from a sequence region with high likelihood of secondary structure, no contiguous 9mer match is allowed when a primer sequence is compared to the complementary strand of its cognate gene sequence. A BLAST similarity search for the primer sequence is also carried out on the complementary strand and the score is required to be less than 18.
The formation of products arising from primers serving as template (primer dimers) can deplete free primers and result in poor PCR yield. Primer dimers are a common cause of real-time PCR quantitation failures when DNA intercalating dyes (e.g. SYBR Green I) are used. To prevent primer homodimer formation, candidate primers are rejected if the four residues at the 3′ end of a primer could be found in its complementary sequence. Complementarity of the forward and reverse primers in a primer pair is examined in the same way to prevent detrimental heterodimer formation.
Distribution of the rejected primers
15 562 332 primers were evaluated before 37 277 primer pairs were picked to cover 15 697 mouse genes. The very high rejection rate, 99.5%, reflects filter stringency. The distribution of the rejected mouse primers is shown in Figure . Among the rejected primers, 50.7% had too high or too low Tm values, 28.7% cross-hybridized to non-target genes, 19.8% were rejected because of sequence self-complementarity, 0.5% were from low complexity regions and 0.3% were rejected because of other properties (GC content and end stability).
Figure 2 Distribution of the rejected mouse primers. 15 562 332 primers were rejected during primer selection. They were rejected because they could not meet the primer selection criteria for melting temperature (Tm), cross-match to other sequences, sequence (more ...)
The online primer database
Successfully designed human and mouse primers were imported in a MySQL database installed on a Linux server. A web-based interface was established to allow users to query the primer database, PrimerBank. Figure shows the search page of the website. 147 404 primers were picked and included in PrimerBank to cover 16 293 human and 15 697 mouse genes. There are several ways to search for primers: by GenBank accession no., NCBI protein accession no., LocusLink ID, PrimerBank ID or Keyword (gene description). Batch primer retrieval is also available by entering multiple IDs at the same time. Detailed instructions are included in the Help page of the website. Because of the sequence redundancy in public sequence databases, PrimerBank uses LocusLink index files (29
), updated weekly from ftp://ftp.ncbi.nih.gov/
, to map gene accessions to gene loci and associate the gene information with the primers.
Figure 3 A screenshot of the web interface for PrimerBank. There are several ways to search for primers: GenBank accession no., NCBI protein accession no., LocusLink ID, PrimerBank ID or Keyword (gene description). PrimerBank currently contains 147 404 primers (more ...)
Experimental evaluation of the primers
To evaluate the quality of the primers identified by the algorithm, 112 primer pairs representing 108 genes were tested in conventional RT–PCR and real-time PCR experiments. The primer information was retrieved from PrimerBank and is summarized in Supplementary Material Table S1. The genes were chosen because they had been shown to be expressed in mouse liver by microarray experiments and were of interest to local investigators (unpublished data). Some genes were from closely related gene families. Among them, 16 genes were from the cytochrome P450 family and five genes were from the Dok family.
The results for the 16 cytochrome P450 genes are included here as examples and the relevant primer information is summarized in Table . The cytochrome P450 genes are closely related and the sequence similarity is ~90% between some family members. Despite the high template homology, all 16 PCRs resulted in single specific amplicons, determined by gel electrophoresis (Fig. A). All 16 P450 genes were also efficiently amplified in real-time PCR and the amplification plots indicated no obvious correlation between amplicon length and PCR efficiency (Fig. A). The melting curve analysis indicated single amplicons for 15 of these P450 genes (six examples shown in Fig. B). PCR specificity was confirmed by sequencing the PCR products.
Primer information for 16 cytochrome P450 genes
Figure 4 Gel electrophoresis of PCR products. (A) PCR amplifications of 16 cytochrome P450 genes. Lane 1, 25 bp DNA ladder; lanes 2–17, 10 µl PCR products of P450 1a2, 2a5, 2b9, 2b13, 2c29, 2c38, 2c40, 2d26, 2e1, 2j5, 3a16, 3a25, 4a10, 4a12, 4a14 (more ...)
Real-time PCR of cytochrome P450 genes. (A) PCR amplification plots for 16 cytochrome P450 genes. (B) Melting curves of six genes from cytochrome P450 families 1 and 2 (plotted as the first derivative of the absorbance with respect to temperature).
An analysis of PCR efficiency was also conducted by measuring the slope of a standard curve created from serially diluted templates. Six primer pairs with a range in predicted amplicon length of 152–347 bp were analyzed and yielded an efficiency of 96 ± 4%.
Among the 112 primer pairs tested, 106 detected their target genes in liver total RNA. Literature searching indicated that five of the six undetected genes had been shown to be expressed in tissues other than liver (31
). Thus total RNA from embryo, brain, kidney or testis was used to test primers designed for these genes (see Supplementary Material Table S1). In this case five primer pairs yielded single specific PCR products. Only one gene was not detected using the primer pairs we designed. Among the 106 genes detected in liver, all except one primer pair resulted in single specific amplicons on agarose gel (unpublished data). One primer pair yielded a minor band in addition to the desired major band (Fig. B). Sequencing indicated this is a novel splice isoform that was not identified in GenBank.
The 112 primer pairs were also tested in real-time PCR experiments. Melting curve analysis (plotted as the first derivative of the absorbance with respect to temperature) indicated the presence of single PCR products in 104 PCRs. Six reactions resulted in bimodal first derivative plots, although single bands were observed by agarose gel. Sequencing results confirmed that these PCR products were homogeneous and correct, indicating that the observed heterogeneity in melting temperature was due to internal sequence inhomogeneity (e.g. independently melting blocks of high and low GC content) rather than amplicon contamination. In summary, 110 out of 112 primer pairs led to single specific PCR products yielding a primer design success rate of 98.2%.