Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2722911

Formats

Article sections

- Abstract
- Introduction
- Estimation
- Hypothesis Testing
- Simulation Studies
- Application to HapMap Data
- Discussion
- Literature Cited

Authors

Related links

Theor Popul Biol. Author manuscript; available in PMC 2010 June 1.

Published in final edited form as:

Published online 2009 April 2. doi: 10.1016/j.tpb.2009.03.005

PMCID: PMC2722911

NIHMSID: NIHMS107010

The publisher's final edited version of this article is available at Theor Popul Biol

See other articles in PMC that cite the published article.

The coancestry coefficient, also known as the population structure parameter, is of great interest in population genetics. It can be thought of as the intraclass correlation of pairs of alleles within populations and it can serve as a measure of genetic distance between populations. For a general class of evolutionary models it determines the distribution of allele frequencies among populations. Under more restrictive models it can be regarded as the probability of identity by descent of any pair alleles at a locus within a random mating population. In this paper we review estimation procedures that use the method of moments or are maximum likelihood under the assumption of normally distributed allele frequencies. We then consider the problem of testing hypotheses about this parameter. In addition to parametric and non-parametric bootstrap tests we present an asymptotically-distributed chi-square test. This test reduces to the contingency-table test for equal sample sizes across populations. Our new test appears to be more powerful than previous tests, especially for loci with multiple alleles. We apply our methods to HapMap SNP data to confirm that the coancestry coefficient for humans is strictly positive.

When populations occur as geographic isolates there is general population-genetic interest in quantifying the resulting degree of genetic differentiation. This differentiation may point to the actions of natural selection, for example, (Akey et al., 2002) or it may reflect patterns of migration. In this second case, however, McCauley and Whitlock (1999) point to the need for caution. Other situations arise where the nature of population subdivision is not known but must be accommodated: for example the probability that two people share the same forensic genetic profile depends on allele frequencies in the subpopulation to which they both belong but estimated frequencies may be available only for a larger population. The remedy is to apply an adjustment for population structure (Weir, 2007). These and other applications of measures of population differentiation have been reviewed recently by Holsinger and Weir (2009). In this article we concentrate on estimating a relevant parameter and testing hypotheses about its value.

Our discussion is framed in terms of pairs of alleles for the same gene. Although genetic data for diploids are collected for genotypes we will assume that populations are mating at random so that a sample of *n*/2 genotypes can be regarded as a sample of *n* copies of a gene and we treat a copy as a sampling unit. We will use “allele” for the sampling unit, as in “a sample of *n* alleles“ and also as an alternative form of a gene, as in “a gene with *m* alleles.” The meaning will be clear in any context. For any two alleles in a population we work with a measure *θ*. This quantity is most generally defined as the correlation of alleles (Wright, 1931) although, as Holsinger and Weir (2009) emphasize, Wright was concerned with defining and using the parameter rather than with estimation or testing hypotheses. The essential nature of *θ* in capturing variation among populations is best revealed with the indicator variables of Cockerham (1969). If *x _{j}* is an indicator variable for the

We wish to draw inferences about *θ* from a set of sample allele frequencies. Estimation of *θ* and its analogs have been discussed widely in the literature (Robertson and Hill, 1984; Weir and Cockerham, 1984; Balding, 2003). The frequentist approaches of the method of moments and maximum likelihood as well as Bayesian methods have been used. The frequentist methods are computationally less intensive whereas the Bayesian approaches have the benefit of systematic incorporation of prior information. Robertson and Hill (1984) and Weir and Cockerham (1984) discussed moment estimators. Their methods do not account for linkage disequilibrium between loci in combining the information over loci and the best way of combining information over loci depends on the magnitude of *θ* (Weir and Cockerham, 1984). Weir and Hill (2002) proposed a maximum likelihood estimator of *θ* assuming a multivariate normal distribution across populations for allele frequencies. Other authors (e.g. Holsinger, 1999; Balding and Nichols, 1997; Balding, 2003) have worked with a Dirichlet distribution. All authors invoke multinomial sampling within populations. In this paper we will discuss both moment and maximum likelihood estimators of *θ* and compare their performance using simulation studies.

For testing hypotheses, appeal has been made to bootstrapping (Dodds, 1986) and permutation methods (Roff and Bentzen, 1989; Raymond and Rousset, 1995). Weir and Hill (2002) found a chi-square test statistic by assuming normally-distributed allele frequencies. Li (1996) and Samanta (2006) gave alternative derivations of chi-square statistics. In this paper we propose two testing methods. The first approach is based on a parametric bootstrap method and works better for small sample sizes whereas the second approach is based on the large sample properties of allele frequencies. Both the approaches are defined for any number of allelic forms at a particular locus. We show that these new test procedures are better than existing testing methods in terms of power.

For a set of populations indexed by *i* we extend the definition of indicator variables to *x _{ij}* for the

A moment estimate of *θ* was given by Weir and Cockerham (1984) and, assuming normally-distributed allele frequencies, a maximum likelihood estimate was given by Weir and Hill (2002). Bayesian approaches were adopted by Balding and Nichols (1997), Holsinger (1999), and Nicholson et al. (2002). Another approach that regards *θ* as an overdispersion parameter, and builds on the work of Mosimann (1962) and Neerchal and Morel (2005), has been discussed by T. Tvedebrink (personal communication).

The moment estimate of Weir and Cockerham (1984) follows directly from Table 1. Using allele *A*_{1} frequencies only, there are two mean squares

$$\begin{array}{l}{\text{MSA}}_{1}=\frac{{\sum}_{i=1}^{r}{n}_{i}{({\stackrel{\sim}{p}}_{i1}-{\overline{p}}_{1})}^{2}}{r-1}\\ {\text{MSW}}_{1}=\frac{{\sum}_{i=1}^{r}{\sum}_{j=1}^{{n}_{i}}{\stackrel{\sim}{p}}_{i1}(1-{\stackrel{\sim}{p}}_{i1})}{{\sum}_{i=1}^{r}({n}_{i}-1)}\end{array}$$

and the estimate can be written as _{1}:

$${\widehat{\theta}}_{1}=\frac{{\text{MSA}}_{1}-{\text{MSW}}_{1}}{{\text{MSA}}_{1}+({n}_{c}-1){\text{MSW}}_{1}}$$

where
${n}_{c}=({S}_{1}-{S}_{2}^{2}/{S}_{1})/(r-1),{S}_{1}={\sum}_{i=1}^{r}{n}_{i},{S}_{2}={\sum}_{i=1}^{r}{n}_{i}^{2}$. Under the assumption that *θ* is the same for all alleles at a locus, implying that the alleles are selectively neutral and that mutation rates are the same for all alleles, Weir and Cockerham (1984) combined estimates over alleles by summing numerators and denominators separately. If alleles are indexed by *u* and there are *m* different alleles:

$$\widehat{\theta}=\frac{{\sum}_{u=1}^{m}({\text{MSA}}_{u}-{\text{MSW}}_{u})}{{\sum}_{u=1}^{m}[{\text{MSA}}_{u}+({n}_{c}-1){\text{MSW}}_{u}]}$$

If data are available from a series of loci *l* (*l* = 1, 2,…, *L*), there are mean squares for each allele at each locus and Weir and Cockerham (1984) combined estimates over loci to get a final moment estimate * _{M}*:

$${\widehat{\theta}}_{M}=\frac{{\sum}_{l=1}^{L}{\sum}_{u=1}^{m}({\text{MSA}}_{lu}-{\text{MSW}}_{lu})}{{\sum}_{l=1}^{L}{\sum}_{u=1}^{m}[{\text{MSA}}_{lu}+({n}_{cl}-1){\text{MSW}}_{lu}]}$$

The “average” sample sizes *n _{cl}* are likely to be different at each locus.

It is difficult to derive expressions for the mean and variance of the moment estimate. Dodds (1986) and Weir (1996) suggested numerical resampling for obtaining the sampling distribution of * _{M}*. Resampling over populations would change the structure of the data but resampling over loci exploits the assumption that (unlinked) loci provide independent replicates of the evolutionary process. Resampling was also used by Raymond & Rousset (1995). Jiang (1987) used a Taylor series expansion and approximate higher-order moments of sample allele frequencies to obtain the mean and variance of

$${\text{MSA}}_{1}\sim {p}_{1}(1-{p}_{1})[1+({n}_{c}-1)\theta ]{\chi}_{(r-1)}^{2},$$

and that the mean square MSW_{1} tends to a constant value of *p*_{1}(1 − *p*_{1})(1 − *θ*). These results allowed her to derive expressions for the mean and variance of * _{M}*:

$$\mathcal{E}({\widehat{\theta}}_{M})=\theta -\frac{2(1-\theta )}{r-1}{\left(\frac{1+({n}_{c}-1)\theta}{{n}_{c}}\right)}^{2}$$

(1)

$$\text{Var}({\widehat{\theta}}_{M})=\frac{2{(1-\theta )}^{2}}{r-1}{\left(\frac{1+({n}_{c}-1)\theta}{{n}_{c}}\right)}^{2}.$$

(2)

The variance formula differs slightly from the variance of the intraclass correlation given by Fisher (1921), but is equal to that result for large sample sizes.

The maximum likelihood estimate of Weir and Hill (2002) follows from assuming the (*m* − 1) × 1 vector * _{i}* = [

$$\stackrel{\sim}{\mathbf{P}}\sim \text{MVN}(\mathbf{P},\mathbf{V})$$

where

$$\stackrel{\sim}{\mathbf{P}}=\left[\begin{array}{l}{\stackrel{\sim}{\mathbf{p}}}_{1}\hfill \\ {\stackrel{\sim}{\mathbf{p}}}_{2}\hfill \\ \cdots \hfill \\ {\stackrel{\sim}{\mathbf{p}}}_{r}\hfill \end{array}\right],\phantom{\rule{0.38889em}{0ex}}\mathbf{P}=\left[\begin{array}{c}\mathbf{p}\\ \mathbf{p}\\ \cdots \\ \mathbf{p}\end{array}\right],\phantom{\rule{0.38889em}{0ex}}\mathbf{V}=\left[\begin{array}{llll}{\mathbf{V}}_{11}\hfill & {\mathbf{V}}_{12}\hfill & \cdots \hfill & {\mathbf{V}}_{1r}\hfill \\ {\mathbf{V}}_{21}\hfill & {\mathbf{V}}_{22}\hfill & \cdots \hfill & {\mathbf{V}}_{2r}\hfill \\ \cdots \hfill & \cdots \hfill & \cdots \hfill & \cdots \hfill \\ {\mathbf{V}}_{r1}\hfill & {\mathbf{V}}_{r2}\hfill & \cdots \hfill & {\mathbf{V}}_{rr}\hfill \end{array}\right]$$

The vectors * _{i}* and

$${\mathbf{V}}_{ii}={\phi}_{i}\left[\begin{array}{cccc}{p}_{1}(1-{p}_{1})& -{p}_{1}{p}_{2}& \cdots & -{p}_{1}{p}_{m-1}\\ -{p}_{1}{p}_{2}& {p}_{2}(1-{p}_{2})& \cdots & -{p}_{2}{p}_{m-1}\\ \cdots & \cdots & \cdots & \cdots \\ -{p}_{1}{p}_{m-1}& -{p}_{2}{p}_{m-1}& \cdots & {p}_{m-1}(1-{p}_{m-1})\end{array}\right]$$

where * _{i}* = [

From Corollary 1.7 and Theorem 3.5 of Serfling (1980), the quadratic form

$$\sum _{i=1}^{r}({\stackrel{\sim}{\mathbf{p}}}_{i}-{\mathbf{p}}_{i}{)}^{\prime}{V}_{ii}^{-1}({\stackrel{\sim}{\mathbf{p}}}_{i}-{\mathbf{p}}_{i})=\sum _{i=1}^{r}\sum _{u=1}^{m}\frac{1}{{p}_{u}}{\left(\frac{{\stackrel{\sim}{p}}_{iu}}{\sqrt{{\phi}_{i}}}-\frac{1}{r}\sum _{j=1}^{r}\frac{{\stackrel{\sim}{p}}_{ju}}{\sqrt{{\phi}_{j}}}\right)}^{2}$$

has a chi-square distribution, opening the way for both estimation and hypothesis testing. Substituting the sample allele frequencies in the denominators of this expression gives the statistic *T*

$$T=\sum _{i=1}^{r}\sum _{u=1}^{m}\frac{1}{{\overline{p}}_{u}}{\left(\frac{{\stackrel{\sim}{p}}_{iu}}{\sqrt{{\phi}_{i}}}-\frac{1}{r}\sum _{j=1}^{r}\frac{{\stackrel{\sim}{p}}_{ju}}{\sqrt{{\phi}_{j}}}\right)}^{2}$$

(3)

For estimation, closed-form expressions are possible only if all the * _{i}*’s are the same. This will happen if all the sample sizes are equal,

$${\widehat{\theta}}_{N}=\frac{n}{(n-1)(r-1)(m-1)}\sum _{i=1}^{r}\sum _{u=1}^{m}\frac{{({\stackrel{\sim}{p}}_{iu}-{\overline{p}}_{u})}^{2}}{{\overline{p}}_{u}}-\frac{1}{n-1}$$

For large sample sizes this becomes (Hill and Robertson, 1984)

$${\widehat{\theta}}_{N}=\frac{1}{(r-1)(m-1)}\sum _{i=1}^{r}\sum _{u=1}^{m}\frac{{({\stackrel{\sim}{p}}_{iu}-{\overline{p}}_{u})}^{2}}{{\overline{p}}_{u}}$$

(4)

To the extent that *T* has a chi-square distribution, * _{N}* is an unbiased estimator of

$$\text{Var}({\widehat{\theta}}_{N})=\frac{2{[1+(n-1)\theta ]}^{2}}{{(n-1)}^{2}(m-1)(r-1)}\approx \frac{2{\theta}^{2}}{(r-1)(m-1)}$$

If data are available from *L* independent loci then the estimates are simply averaged over loci. If the *l ^{th}* locus has

$$\begin{array}{l}\text{Var}({\widehat{\theta}}_{N})=\frac{2{[1+(n-1)\theta ]}^{2}}{{(n-1)}^{2}(r-1){\sum}_{l=1}^{L}({m}_{l}-1)}\\ \approx \frac{2{\theta}^{2}}{(r-1){\sum}_{l=1}^{L}({m}_{l}-1)}\end{array}$$

(5)

We will restrict attention to the hypotheses that *θ* is either zero or greater than zero, *H*_{0}: *θ* = 0, *H*_{1}: *θ* > 0, and we make the distinction (Weir, 1996) between fixed and random population models. The fixed model takes the data as being from a set of populations for which no inferences about evolutionary history are to be drawn. Instead, inferences are drawn on just the sampled set of populations. As no genetic or evolutionary model is necessary a purely statistical approach is appropriate and a contingency table test for independence of allele frequencies and populations is a very direct procedure (Raymond and Rousset, 1995; Roff and Bentzen, 1989). From Table 2, the chi-square test statistic for independence is

$${X}^{2}=\sum _{i=1}^{r}\sum _{u=1}^{m}\frac{{n}_{i}{({\stackrel{\sim}{p}}_{iu}-{\overline{p}}_{u})}^{2}}{{\overline{p}}_{u}}$$

(6)

and this is assumed to have a chi-square distribution with (*r*−1)(*m*−1) df under the null hypothesis. Note that if the sample sizes are equal, *n _{i}* = =

The maximum likelihood approach in the estimation section invoked a random model in the sense that a distribution was assumed for allele frequencies over populations. There is an implicit evolutionary model that describes the relationships among populations resulting from a shared history. Under the hypothesis that *θ* is zero, * _{i}* = 1/

$$T=\sum _{i=1}^{r}\sum _{u=1}^{m}\frac{1}{{\overline{p}}_{u}}{\left(\sqrt{{n}_{i}}{\stackrel{\sim}{p}}_{iu}-\frac{1}{r}\sum _{j=1}^{r}\sqrt{{n}_{j}}{\stackrel{\sim}{p}}_{ju}\right)}^{2}$$

(7)

and (Appendix of Samanta, 2006) this is distributed as
${\chi}_{(r-1)(m-1)}^{2}$. Values of *T* can be added over independent loci and the df are also summed. For equal sample sizes, *T* in Equation 7 is the same as *X*^{2} in Equation 6. Going back to Equation 3, however, would allow hypotheses for values of *θ* other than *θ* = 0 to be tested.

Li (1996) based a test on the analysis of variance framework shown in Table 1. For allele *A*_{1} she found that (*r* − 1) MSA_{1}/MSW_{1} was asymptotically distributed as
${\chi}_{(r-1)}^{2}$. Dodds (1986) used the moment estimate * _{M}* and bootstrapped over loci to generate an empirical distribution for the estimator. This non-parametric bootstrap leads to a one-sided 100(1 −

Another test procedure is based on parametric bootstrap resampling, developed specifically for small sample sizes. Under the null hypothesis of zero *θ* the observed allele frequencies * _{u}* over the whole data set are maximum likelihood estimates of the parameters

The moment estimator and the maximum likelihood estimator based on normal distribution of the allele frequencies were applied to the case of five populations that have evolved independently from a single population. We simulated data using a pure drift model. The simulation was for a single locus and 10 loci with *m* = 2, 3 or 4 alleles, all equally frequent initially. We assumed every population in the evolutionary processes had 500 individuals and we sampled 400 alleles for each population to estimate the coancestry coefficient *θ*. We consider three different current ages of all the populations, 11, 52 and 106 to give predicted *θ* values of 0.011, 0.051 and 0.101.

The biases and sampling errors of the estimators were calculated using 1, 000 replicates. Table 3 shows that both the moment estimator, * _{M}*, and the maximum likelihood estimator based on normal distribution,

Figure 1 shows that the asymptotic distribution of the scaled *T* is indeed chi-square with appropriate degrees of freedom as assumed earlier. The figure shows that under different values of *θ* and either onr or ten loci, histograms of 1,000 simulated values of *T*/[1+(*n* − 1)*θ*] are very similar to the density of central chi-square distributions. The non-parametric Kolmogorov-Smirnov Test produces non-significant *p*-values for testing the hypotheses that the empirical distribution of *T* is a scaled central chi-square distribution for different values of *θ*.

The testing procedures described in this paper were applied to the same simulated data as for the study of estimation, and we present the results in Tables 4 and and5.5. For the two bootstrap tests we performed the tests at a 5% significance level and show in Table 4 that the empirical significance level of the parametric bootstrap is always close to 5%. In some situations for small numbers of locl (*L* = 5), the empirical significance level of the non-parametric bootstrap test exceeded the nominal level. The power of both bootstrap tests increases with *θ*, with the sample size, with the number of alleles per locus, and with the number of loci. Table 4 also shows that the parametric bootstrap method generally has the higher power and we recommend its use over the non-parametric bootstrap.

Empirical powers of parametric and non-parametric bootstrap tests, *n* and *L* represent the number of sampled alleles and the number of sampled loci respectively in each population.

Empirical powers of newly proposed chi-square test statistics with Li’s test procedure. *n* represents the number of sampled alleles in each population.

For large sample sizes we compared our new test statistic (Equation 7) with the test statistic proposed by Li (1996). The power of these two chi-square tests is shown in Table 5. For a 5% significance level, both tests have approximately 5% power when the null hypothesis is true, showing that the tests have a correct size. The power of the tests increases with the true value of *θ* and with the sample size. For loci with two alleles, the tests have similar power. For multiple alleles, the power of our new test increases with the number of alleles and we recommend its general use.

The International HapMap project (2005) generated two-allele SNP data from 270 people: Yoruba people in Ibadan, Nigeria (30 adult-and-both-parents trios), Japanese in Tokyo (45 unrelated individuals), Han Chinese in Beijing (45 unrelated individuals) and U.S. residents of northern and western European ancestry (30 trios). We applied our procedures to test if there is positive coancestry in human genome. We also used both moment and MLE estimators of *θ* to estimate *θ* in these different human populations and to quantify the heterogeneity among genome regions.

We estimated the coancestry coefficient using only those SNPs that were found to be segregating in all population samples. Due to the sampling scheme and missing data at different loci the number of alleles in the four different samples are different but the sample sizes are large enough for us to assume the same variance of allele frequencies among different populations. For maximum likelihood we used the estimator in Equation 4. Our estimates were calculated for all markers separately and also for all markers in all the 5Mb windows centered on each SNP in the autosomal genome.

There is substantial variation among estimates over the genome, even among SNPs that are very close to each other. The single-locus estimates are distributed very much like the χ^{2} distribution with two or three degrees of freedom. The extreme noisiness in single-locus estimates is demonstrated in Table 6, where the standard errors of the values for each chromosome are seen to be about the same size as the means. The noisiness of single-locus estimates can be reduced by combining data from several adjacent markers. The distribution of these (approximately) 1,000-locus values is close to a normal distribution (Weir et al., 2006) as expected from the chi-square distribution tending to normality as the df increase. Table 6 shows that the chromosomal standard errors of the estimates have dropped substantially for 1,000 loci. Even for the relatively large window size of 5 Mb there is substantial variation in estimates along each chromosome. Table 6 shows positive values for the coancestry coefficient, in the range of 0.1 to 0.15, and the hypothesis *θ* = 0 is rejected.

The coancestry coefficient *θ* is of central importance in population genetics and there is widespread interest in estimating this quantity when genetic data are available from different populations. In this article we have considered a moment estimator and a maximum likelihood estimator assuming normally distributed allele frequencies. The overall similarity of the two estimates and the advantage of having a chi-square distribution suggests general use of the maximum likelihood estimate. The moment estimator would be preferred in situations of small samples. In any event, use of simple estimators such as
${s}_{p}^{2}/\overline{p}(1-\overline{p})$, where and
${s}_{p}^{2}$ are the sample mean and variance of allele frequencies over populations, is not recommended.

The sampling properties of both estimators were addressed with analytic expressions for the mean and variance and with simulation studies. Both approaches showed that the biases of both the estimators of *θ* are relatively small in magnitude, and negative in direction. The biases and variances of the estimators of *θ* increase as the differentiation levels increase in a total population, i.e. they increase with *θ*. Evolutionary variation, as measured by *θ*, cannot be reduced by sampling design. Increasing the number of loci sampled has a stronger effect on reducing the sampling variance of both the estimators than increasing the number of individuals sampled.

There are many factors that control the power of statistical tests. The power of the all test procedures discussed here increases with the true value of *θ* and with the sample size. When the true value of *θ* is small, then increasing the number of loci sampled has a stronger effect than increasing the number of individuals sampled. The power of the tests increases with the number of alleles per locus. Simulation studies show that our parametric bootstrap testing procedure has higher power than that of non-parametric bootstrap. Li (1996) used the central limit theorem for approximating the distribution of allele frequencies as a normal distribution and proposed a chi-square test. Here we have proposed an extension of her method that allows for an arbitrary number of alleles per locus. Our extension, captured in the statistic *T*, allows tests to be made about hypothesized values of *θ*, including *θ* = 0.

We have not addressed the effects of linkage or linkage disequilibrium on the two estimates. When loci are regarded as being independent it is easy to show that increasing the number of loci decreases the expected variance of the maximum likelihood estimator, and this effect was seen numerically for both estimates. The independence assumption will often be adequate. Weir et al. (2005) did allow for linkage among loci. By assuming Haldane’s mapping function they were able to predict variances of the “actual” values of *θ* for sets of linked loci. In this article we have written as though there is a single value of *θ* that applies to all genes in the genome. The stochastic nature of evolutionary forces such as mutation and the differences in actual genealogies among loci, however, means that there will be variation in actual values around the single theoretical value.

We should stress that our treatment of inference makes no assumption about the evolutionary forces that have been operating prior to the time populations were sampled to provide data. We have assumed that the mean of frequency of an allele over populations is some parameter *p* and that the variance of these frequencies among populations is *θ p*(1 − *p*). For maximum likelihood estimation we assumed a normal distribution of allele frequencies. In other words, maximum likelihood estimation assumes a distributional form for allele frequencies whereas the moment estimators assume the form of only the first two moments of these distributions. Other than that, our procedures hold regardless of the nature of mutation, migration, selection, population size or mating pattern. The interpretation of a particular numerical value of an estimate, on the other hand, is very much dependent on which evolutionary forces are assumed to have been acting. One of the most common uses of the estimates is to regard them as genetic distances between populations (Reynolds et al., 1983) and reconstruct the phylogeny of the populations – this application requires a drift-only model without mutation and so applies only within species.

This work was supported in part by NIH grant GM 075091. The third author recalls with gratitude the contributions of Sam Karlin to population genetics and his leadership for this journal.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Balding DJ. Likelihood-based inference for genetic correlation coefficients. Theor Pop Biol. 2002;63:221–230. [PubMed]
- Balding DJ, Nichols RA. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity. 1997;78:583–589. [PubMed]
- Cockerham CC. Variance of gene frequencies. Evolution. 1969;23:72–84.
- Dodds KG. PhD Thesis. North Carolina State University; Raleigh, NC.: 1986. Resampling Methods in Genetics and the Effect of Family Structure in Genetic Data.
- Holsinger KE. Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas. 1999;130:245–255.
- Holsinger KE, Weir BS. Genetics in geographcally structured populations: defining, estimating, and interpreting
*F*. Nature Reviews Genetics. 2009 (in press) [PMC free article] [PubMed]_{ST} - International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;426:789–796.
- Jiang C. PhD Thesis. North Carolina State University; Raleigh, NC.: 1987. Estimation of F-statistics in subdivided populations.
- Lange K. Applications of the Dirichlet distribution to forensic match probabilities. Genetica. 1995;96:107–117. [PubMed]
- Li Y-J. PhD Thesis. North Carolina State University; Raleigh, NC.: 1996. Characterizing the Structure of Genetic Populations.
- Mosimann JE. On the compound multinomial distribution, the multivariate
*β*-distribution and correlations among proportions. Biometrika. 1962;49:65–82. - Nei M. Molecular Evolutionary Molecular Genetics. Columbia University Press; New York: 1987.
- Neerchal NK, Morel JG. An improved method for the computation of maximum likelihood estimates for multinomial overdispersion models. Comp Stat Data Anal. 2005;49:33–43.
- Nicholson G, Smith AV, J’onsson F, Gustafsson Ó, Stefánsson K, Donnelly P. Assessing population differentiation and isolation from single nucleotide polymorphism data. Proc Roy Stat Soc B. 2002;64:695–715.
- Raymond M, Rousset F. An exact test for population differentiation. Evolution. 1995;49:1280–1283.
- Reynolds J, Weir BS, Cockerham CC. Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics. 1983;105:767–779. [PubMed]
- Robertson A, Hill WG. Deviations from Hardy-Weinberg proportions: sampling variances and use in estimation of inbreeding coefficients. Genetics. 1984;107:703–718. [PubMed]
- Roff DA, Bentzen P. The statistical analysis of mitochondrial polymorphisms: χ
^{2}and the problem of small sample sizes. Molecular Biology and Evolution. 1989;6:539–545. [PubMed] - Samanta S. PhD Thesis. North Carolina State University; Raleigh, NC.: 2006. A Statistical Characterization of the Genetic Structure of Populations.
- Serfling RJ. Approximation Theorems of Mathematical Statistics. Wiley; New York: 1980.
- Weir BS. Genetic Data Analysis II. Sinauer; Sunderland, MA: 1996.
- Weir BS. The rarity of DNA profiles. Annals of Applied Statistics. 2007;1:358–370. [PMC free article] [PubMed]
- Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG. Measures of human population structure show heterogeneity among genomic regions. Genome Research. 2005;15:1468–1476. [PubMed]
- Weir BS, Cockerham CC. Estimating
*F*-statistics for the analysis of population structure. Evolution. 1984;38:1358–1370. - Weir BS, Hill WG. Estimating
*F*-statistics. Ann Rev Genet. 2002;36:721–750. [PubMed] - Wright S. The genetical structure of populations. Annals of Eugenics. 1931;15:323–354. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |