Search tips
Search criteria 


Logo of jbtJBT IndexAssociation Homepage
J Biomol Tech. 2003 March; 14(1): 9–16.
PMCID: PMC2279894

Simple Tests to Detect Errors in High-Throughput Genotype Data in the Molecular Laboratory


With the advent of high-density DNA marker data sets for the mouse and other model systems, 100 or more genotypes are routinely generated from large groups of mice. Issues of the accuracy and reliability of the genotyping are extremely important but often not addressed until genetic analysis is conducted. Simple tests that rely on the robust predictions arising from Mendelian genetics can be made quickly in the molecular laboratory as the data are generated, and require only a spreadsheet program. In this report, genotype data from 392 mice tested at 96 marker sites were analyzed for errors that are typical when handling large volumes of data generated in a repetitive process. The testing consisted of: (1) repeating the genotyping of approximately 1% of the samples; (2) examining the deviation from the expected segregation ratio (1:2:1) on a marker-by-marker basis; and (3) testing the correlation of the genotype at one marker with that at neighboring genetic markers on a chromosome. These three steps allowed analysis at the level of the microtiter plate, where errors are most likely to occur. A set of 96 dinucleotide repeat markers that are polymorphic between the C57BL/6J and DBA/2J mouse strains and can be multiplexed is reported for use in other genotyping projects.

Keywords: Hardy–Weinberg equilibrium, inbred lines, C57, DBA, multiplex marker set

Significant theoretical work is available in the literature on the problems created by mistyping or misclassification of genotypes for genetic analysis of both simple and complex traits. These efforts have generated analytical tools to estimate the effects of typing error on subsequent linkage analysis.1– 5 Error filters can then be applied to the genotype data to minimize the interference with detecting linkage. These methods do not address errors in the data that can be corrected prior to linkage analysis to increase the accuracy of the genotypes obtained.

Genetic analysis based on the results of Mendel6 provides a framework with which to analyze genotypes when individuals in a large sample group are tested at multiple loci. Predictions based on Hardy–Weinberg equilibrium allow for a convenient method to compare genotype frequencies with the underlying allele frequencies7, 8 but can become difficult in populations in which breeding is not controlled by the experimenter. In the case of crosses between two strains of inbred mice, Hardy–Weinberg equilibrium is at its simplest (1:2:1 ratio) because allele frequencies are expected to be 0.5 for each genetic marker. We have taken advantage of these laws to describe simple methods for testing the reliability of genotype data in mouse studies prior to genetic analysis. The work of Sturtevant demonstrated that the correlation values between a reference marker and a test marker would diminish with increasing map distance.9 The combination of these tests provides a practical method for examining data as they are being generated.


In connection with a study of age-related processes, genetic analysis was conducted on 392 F2 mice generated from a cross of C57BL/6J and DBA/2J parental strains. A small snip (approximately 2 mm) of the tail of each mouse was taken at weaning for genetic analysis. DNA was extracted by standard lysis with proteinase K digestion followed by phenol/Sevag extraction and ethanol precipitation.10 A portion of the purified DNA was diluted to 10 ng/μL into 96-well plates. Genotyping was carried out using markers from the MIT collection ( Primers were purchased from Research Genetics, Inc. (Huntsville, AL), and the forward primer was fluorescently labeled for allele detection. See Table 11 for the markers used.

Markers for Multiplexed Genotyping to Distinguish C57BL/6J and DBA2/J Mouse Allelesa

The polymerase chain reaction (PCR) reaction was carried out in a total volume of 10 μL consisting of 10 ng of the template DNA, 2.5 mM MgCl2, 10 mM dNTPs, 0.04 mM spermidine, 0.5 U AmpliTaq Gold DNA polymerase (Applied Biosystems Inc., Foster City, CA), and the buffer supplied with the polymerase. Following a denaturation at 95°C for 2 min, 35 cycles of PCR were carried out (45 s at 95°C, 45 s at 59°C, 60 s at 72°C). The samples were electrophoresed on an ABI 310 Genetic Analyzer (Applied Biosystems). Allele fragment sizes were determined by GeneScan software (Applied Biosystems). These sizes were converted to allele calls (either B or D for C57BL/6J or DBA/2J, respectively) with ABI Genotyper software (Applied Biosystems) and exported into an Excel spreadsheet (Microsoft Corporation, Redmond, WA).


Initial testing and selection of markers from the MIT collection were performed on a small set of mouse DNA samples to determine the marker’s ability to fit into a multiplexed genotyping set. Eight samples—the two parental strains, an artificial heterozygote of parental DNA mixed in equal proportion, and five F2 mice selected from each of the 96-well plates—were used to establish the expected electrophoretic patterns of the homozygote and heterozygote alleles. The five mice from the F2 population served as the first test of the large-scale genotype production by comparing the genotypes generated in the marker selection phase with those generated during the high-throughput phase. The genotypes of all five mice were consistent in the two phases, indicating that there were no detectable shifts of the DNA samples during preparation of the 96-well trays. The final set of markers used is shown in Table 11 and is available on the website of the Center for Developmental and Health Genetics (CDHG) (

Approximately 20% of the selected genetic markers could not be used owing to one of three problems: (1) the primers did not generate amplified product; (2) the observed allele sizes were different from those expected based on the MIT website and conflicted with the size of another marker in the same dye color; or (3) the primers worked poorly when amplified in the presence of two or three other markers in the multiplexed PCR reaction. The likelihood that a marker would fail was unpredictable, and replacement markers were chosen to maintain a map position and fit with the allele sizes of the other markers already in its set. The markers that could not be used are shown in Table 22 and on the CDHG website.

Failed Markers and Their Replacements

The second test examined the deviation of the genotype frequencies from the expected Mendelian proportions (1:2:1). A chi-square test of each marker was conducted after all mice were genotyped for each marker. The five markers on the X chromosome are hemizygous in the males (no heterozygous males) and were tested only in the females. Of all 96 markers tested, only two deviated from the expected proportion with a value < 0.05. This number is close to the expected four or five deviations, and uncorrected values were used to minimize false negatives. Examination of the Genotyper files for these markers, D10mit14 and D10mit95, revealed that alleles were misgrouped during the allele-calling step. Marker D10mit14 consists of 191- and 185-bp alleles, and D10mit95 has alleles of 196 and 176 bp. The two larger alleles (196 and 191 bp) were accidentally grouped as one marker, and the two smaller alleles (185 and 176 bp) were grouped as the second marker. Assignment of the alleles to their proper marker restored correct allele frequencies for each of the suspect markers.

The third test examined the correlation of allele status between markers on a chromosome. A detailed protocol is shown in Figure 11 . . The test was conducted by comparing the allele status of one marker with those of the other markers on a chromosome. Mouse chromosomes all have centromeres at one end (acrocentric), and the most centromeric marker was chosen as the reference. For most chromosomes the correlation declined asymptotically to zero as the map distance in centiMorgans increased between the reference and the test markers. This test was repeated for each chromosome using the marker most distal from the centromere as the reference marker because the correlation between two markers at opposite ends of the chromosome was already approaching zero (random segregation), especially for the larger chromosomes, and thus was uninformative. The uncorrected allele correlations for chromosomes 2, 7, 8, and 10 are shown in Figure 2A2A . . For two of the chromosomes (chromosomes 2 and 8), the alleles were negatively correlated, as indicated by values less than zero (Fig. 2A2A ). ). The negatively correlated markers were examined in the original genotype data and were discovered to have reversed allele calls based on the allele sizes of the parental strains. The entire set of values for chromosome 8 was below the abscissa because the reference marker was the miscalled marker. When corrected, the allele markers were all positively correlated, as shown in Figure 2B2B .

Stepwise error analysis from allele calls made by automated DNA fragment size detection software.
A: Correlation values between the most centromeric marker and the other markers on chromosomes 2, 7, 8, and 10 before correction. All markers on chromosome 8 appear to be negatively correlated with the most centromeric marker, whereas only the second ...


The techniques described herein can be used to detect gross errors in high-throughput genotyping, particularly in cases of simple expectations of genotype frequencies such as F2 populations generated from two inbred lines of the organism under analysis. The errors found in this study were due to two types of problems. First, misassignment of a DNA fragment (size in base pairs) during allele calling as an allele of one of the other markers in a dye set caused a skew in the distribution of alleles from the expected 1:2:1 ratio for the erroneous marker. Second, switching the assignment of the genotype (B or D) for a pair of alleles caused markers on a chromosome to appear to be negatively correlated. Both of these mistakes were easily detected and corrected before any genetic analysis of phenotypes was conducted, thus increasing the likelihood that high-quality data were available for analysis. It is important to conduct these tests at the level at which errors are likely to be made—in this case, the microwell plate level. These tests are easy to conduct, requiring widely available software, and can be carried out in the molecular laboratory as data from each chromosome are collected, rather than waiting until the entire data set is generated.


We thank Kate Anthony for expert technical assistance and Jeanne Spicer for website support. This study was supported by the National Institute on Aging (grant AG14731) of the National Institutes of Health. There are no known conflicts of interest on the part of any of the authors.


1. Freimer NB, Sandkuijl LA, Blower SM. Incorrect specification of marker allele frequencies: effects on linkage analysis. Am J Hum Genet 1993;52:1102–1110. [PubMed]
2. Goring HH, Terwilliger JD. Linkage analysis in the presence of errors I: complex-valued recombination fractions and complex phenotypes. Am J Hum Genet 2000; 66:1095–1106. [PubMed]
3. Goring HH, Terwilliger JD. Linkage analysis in the presence of errors II: marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am J Hum Genet 2000;66:1107–1118. [PubMed]
4. Goring HH, Terwilliger JD. Linkage analysis in the presence of errors III: marker loci and their map as nuisance parameters. Am J Hum Genet 2000;66:1298–1309. [PubMed]
5. Goring HH, Terwilliger JD. Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am J Hum Genet 2000; 66:1310–1327. [PubMed]
6. Mendel, G. Versuche über Pflanzen-hybriden [Experiments in plant hybridization]. Verh Naturforsch Ver Abh Brunn 1865;IV:3–47 (
7. Hardy GH. Mendelian proportions in a mixed population. Science 1908;28:49–50. [PubMed]
8. Weinberg W. Über den Nachweis der Vererbung beim Menchen. Jahresh Ver Vaterl Naturkd Wuerttemb 1908; 64:368–382.
9. Sturtevant A. The linear arrangement of six sex-linked factors in Drosophila as shown by their mode of association. J Exp Zool 1913;14:43–59.
10. Sambrook J, Fritsch E, Maniatis T. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 1989.
11. Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT, Mouse Genome Database Group. 2002. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res 2002;30: 113–115. [PMC free article] [PubMed]

Articles from Journal of Biomolecular Techniques : JBT are provided here courtesy of The Association of Biomolecular Resource Facilities