In this study, we have developed a maximum-likelihood approach to jointly estimate sample-specific dropout rates, locus-specific dropout rates, allele frequencies, and the inbreeding coefficient from only one nonreplicated set of microsatellite genotypes. Our algorithm can accurately recover the allelic dropout parameters, and an imputation strategy using the method provides an alternative to ignoring high empirical missing data rates or excluding samples and loci with large amounts of missing data. Investigators can then use the imputed data in subsequent analyses, such as in studies of genetic diversity or population structure, or in software that disallows missing values in the input data. We have demonstrated our approach using extensive analyses of an empirical data set and data sets simulated using parameter values chosen on the basis of the empirical example.
We have found that our method works well on simulated data. In particular, it performs well in estimating the sample-specific dropout rates γi
and locus-specific dropout rates γ
. Further, in the examples we have considered, it is reasonably robust to violations of the model assumptions owing to the existence of population structure or to sources of genotyping error other than allelic dropout. This robustness arises partly from the inclusion of the inbreeding coefficient ρ
in our model, which enables us to capture the deviation from HWE caused by multiple factors, such as true inbreeding, population structure, and genotyping errors. Because the various sources of deviation from HWE are incorporated into the single parameter ρ
, the estimation of ρ
itself is more sensitive to violation of model assumptions; therefore, it is important to be careful when interpreting the estimated value of ρ
, as it may reflect phenomena other than inbreeding. When data are simulated under our model, such as in the cases of F
= 0 and e
= 0, our method tends to slightly underestimate ρ
( and ), indicating that our MLEs are biased, at least for the inbreeding coefficient.
We can use simulation approaches to further explore the statistical properties of our estimates. To examine the consistency of the estimators, we performed two additional sets of simulations, in which we generated genotype data under our model with either different numbers of individuals N
or different numbers of loci L
). When L
is fixed, although estimates of the sample-specific dropout rates γi
are not affected by the value of N
, our estimates of the locus-specific dropout rates γ
and the inbreeding coefficient ρ
become closer to the true values as N
increases (Figure S4
). When N
is sufficiently large (e.g.
= 1600), the estimates of γ
are almost identical to the true values. If we instead fix N
and increase L
, then the estimates of γi
eventually approach the true values, while the estimates of γ
remain unaffected (Figure S5
). These results suggest, without a strict analytical proof, that our MLEs of the dropout rates and inbreeding coefficient are likely to be consistent.
For the Native American data, we can compare the estimated heterozygosities under our model with other data on similar populations. Wang et al. (2007)
studied microsatellites in 29 Native American populations, including 8 populations from regions that overlap those considered in our data. We reanalyzed these populations, 3 from Canada and 5 from Mexico, by calculating observed heterozygosity Ho
from the same 343 loci as were genotyped in our data. We obtained a mean Ho
of 0.670 with standard deviation 0.051 across 176 individuals in the pooled set of 8 populations. In comparison, mean Ho
across our 152 Native American samples is 0.590 (standard deviation = 0.137) before correcting for allelic dropout, substantially lower than in Wang et al. (2007)
, and it is 0.730 (standard deviation = 0.035) after correcting for allelic dropout, higher than in Wang et al. (2007)
. Several possible reasons can explain the imperfect agreement between our corrected heterozygosity and the estimate based on the Wang et al. (2007)
data. First, the sets of populations might differ in such factors as the extent of European admixture, so that they might truly differ in underlying heterozygosity. Second, the Wang et al. (2007)
data might have some allelic dropout as well, so that our Ho
estimates from those data underestimate the true values. Third, our method might have overcorrected the underestimation of Ho
; our simulations show that because we do not model genotyping errors from sources other than allelic dropout, the existence of such errors can lead to overestimation of Ho
). It is also possible that missing genotypes caused by factors other than allelic dropout could have been erroneously attributed to allelic dropout, leading to overestimation of dropout rates and hence to overcorrection of Ho
Our model assumes that all individuals are sampled from the same population with one set of allele frequencies and that inbreeding is constant across individuals and loci. We applied this assumption to the whole Native American data set as an approximation. However, evidence of population structure can be found by applying multidimensional scaling analysis to the Native American samples. As shown in Figure S6
, individuals from different populations tend to form different clusters, indicating that underlying allele frequencies and levels of inbreeding differ among populations. Although our simulations have found that estimation of allelic dropout rates is robust to the existence of population structure, estimation of allele frequencies and the inbreeding coefficient can become less accurate in structured populations. It would therefore have been preferable in our analysis to apply our method on each population instead of on the pooled data set; however, such an approach was impractical owing to the small sample sizes in individual populations. To address this problem, it might be possible to directly incorporate population structure into our model (e.g.
, Falush et al. 2003
), thereby enabling allele frequencies and inbreeding coefficients to differ across the subpopulations in a structured data set. Further, because samples from the same population are typically collected and genotyped as a group, full modeling of the population structure might allow for a correlation in dropout rates across individuals within a population.
An additional limitation of our approach is that during data analysis, we do not take into account the uncertainty inherent in estimating parameters. We first obtain the MLEs of allele frequencies
, allelic dropout rates
, and the inbreeding coefficient
and then create imputed data sets by drawing genotypes using
. This procedure is “improper” because it does not propagate the uncertainty inherent in parameter estimation (Little and Rubin 2002
). To obtain “proper” estimates, instead of using an EM algorithm to find the MLEs of the parameters, we could potentially use a Gibbs sampler or other Bayesian sampling methods to sample parameter values and then create imputed data sets using these sampled parameter sets. In such approaches, parameters sampled from their underlying distributions would be used for different imputations, instead of using the same MLEs for all imputations.
Finally, we have not compared our approach with methods that rely on replicate genotypes. While we expect that replicate genotypes will usually lead to more accurate estimates of model parameters, our method provides a general approach that is relatively flexible and accurate in the case that replicates cannot be obtained. Compared with existing models that assume HWE (e.g.
, Miller et al. 2002
; Johnson and Haydon 2007
), our model uses a more general assumption of inbreeding, and we also incorporate both sample-specific and locus-specific dropout rates. The general model increases the applicability of our method for analyzing diverse genotype data sets, such as those that have significant dropout caused by locus-specific factors (e.g.
, Buchan et al. 2005
). It is worth noting that HWE is the special case of ρ
= 0 in our inbreeding model; when it is sensible to assume HWE, we can simply initiate the EM algorithm with a value of ρ
= 0. This choice restricts the search for MLEs to the ρ
= 0 parameter subspace, because Equation 22 stays fixed at 0 in each EM iteration. Similarly, if we prefer to consider only sample-specific dropout rates (or only locus-specific dropout rates), then we can simply set the initial values of γ
to 0 for all loci (or initial values of γi
to 0 for all individuals). These choices also restrict the search to subspaces of the full parameter space. We have implemented these options in our software program MicroDrop, which provides flexibility for users to analyze their data with a variety of different assumptions.