PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Genet Epidemiol. Author manuscript; available in PMC 2011 December 1.
Published in final edited form as:
PMCID: PMC3087204
NIHMSID: NIHMS288418

Distribution of Model-based Multipoint Heterogeneity Lod Scores

Abstract

The distribution of two-point heterogeneity lod scores (HLOD) has been intensively investigated because the conventional χ2 approximation to the likelihood ratio test is not directly applicable. However, there was no study investigating the distribution of the multipoint HLOD despite its wide application. Here we want to point out that, compared with the two-point HLOD, the multipoint HLOD essentially tests for homogeneity given linkage and follows a relatively simple limiting distribution equation M1, which can be obtained by established statistical theory. We further examine the theoretical result by simulation studies.

Keywords: heterogeneity lod score, distribution, multipoint

Locus heterogeneity represents a form of genetic architecture of complex traits where alleles at more than one locus lead to the same phenotype. It adversely affects the power of linkage analysis if the heterogeneous disease genetic background of families is not taken into account. A natural way to model such heterogeneous data is by a mixture model, as first suggested by Smith [1963]. Under the mixture model framework one can either test for homogeneity given linkage [Ott, 1983] or test for linkage allowing for heterogeneity [Hodge et al., 1983] by a likelihood ratio test. The distribution of two-point heterogeneity lod scores (HLOD) has been intensively investigated [Abreu et al., 2002; Chernoff and Lander, 1995; Chiano and Yates, 1995; Faraway, 1993; Huang and Vieland, 2001; Lemdani and Pons, 1995; Liang and Rathouz, 1999] because the conventional χ2 approximation to the likelihood ratio test is not directly applicable [Davies, 1977, 1987]. However, to our surprise, there was no study investigating the distribution of the multipoint HLOD despite its wide application. The multipoint HLOD is reported by popular software packages such as GENEHUNTER [Kruglyak et al., 1996] and MERLIN [Abecasis et al., 2002] without a P-value accompanying it. Here we want to point out that, compared with the two-point HLOD, the multipoint HLOD essentially tests for homogeneity given linkage and follows a relatively simple limiting distribution, which can be obtained by established statistical theory. We further examine the theoretical result by simulation studies.

Denote by M the genotype data, by D the phenotype data, by α the admixture parameter, which indicates the proportion of families linked to the locus tested, and by x the map position of a putative disease locus. A general format of the likelihood for the ith family is defined as, Li(M, D; x, α) = αLi(M, D; x)+(1 − α)Li(M, D; x = ∞), and the HLOD is defined as HLOD = log10[sup L(α, x)/L0(α = 1; x = ∞)], where the subscript 0 denotes the null hypothesis of no linkage. In two-point analysis, x is parameterized as the recombination fraction, commonly denoted as θ [set membership] [0,0.5], between the marker and disease loci by a map function. The two-point analysis tests the hypotheses H0: θ = 0.5 versus H1: θ < 0.5. The likelihood ratio test statistic is Ttwo-point = 2 ln[L([alpha], [theta w/ hat])/L(θ = 0.5)]. Note that when θ = 0.5, α is unidentifiable, thus the conventional χ2 approximation to the likelihood ratio test is not directly applicable [Davies, 1977, 1987], which causes difficulties in determining the distribution of Ttwo-point. In multipoint analysis, under the null hypothesis x = ∞, which is equivalent to θ = 0.5 in the two-point analysis by a map function; however, the alternative hypothesis is that the putative disease locus is at a specific location c, i.e., x is not a free parameter. The likelihood ratio test statistic is expressed as Tmultipoint = 2 ln[L([alpha], x = c)/L(α = 1; x = ∞)], which indicates the multipoint analysis tests the hypotheses H0: α = 0 versus H1: α > 0. Thus, the multipoint HLOD essentially tests for homogeneity given linkage, in contrast to the two-point HLOD testing for linkage allowing for heterogeneity. The multipoint HLOD test statistic corresponds to case 5 of Self and Liang [1987] with one parameter on the boundary of parameter space, and also corresponds to example 11 of Lindsay [1995] that tests H0: π = 0 versus H1: π > 0 in the mixture model (1 − π)fg, where f and g are both known. Thus, following the theoretical arguments by both authors, the asymptotic distribution of Tmultipoint is equation M2, where equation M3 denotes a distribution degenerate at zero with probability one. It is of interest to note that, given linkage, the two-point homogeneity test statistic 2 ln[L([alpha], [theta w/ hat])/L(α = 1, [theta w/ hat])] follows the same asymptotic distribution as Tmultipoint [Ott, 1999].

Below we examine the theoretical result by simulation studies under varying simulation models, analysis models, and sample sizes. We simulated nuclear families consisting of two parents and four children, and two linked markers 10 cM apart with four alleles of equal frequency at each locus. Two randomly chosen children were set to be affected and the other two unaffected. The parental phenotypes were simulated under three models. In model I, one parent was set to be affected and the other unaffected, which mimicked a dominant trait; in model II, both parents were set to be unaffected, which mimicked a recessive trait; and in model III, both parents were set to be unknown, which mimicked a trait with mode of inheritance unclear. We generated 5,000 replicate samples under each of the four sample size scenarios—100, 500, 1,000 and 5,000 families in a sample dataset. Denote the penetrance by fi, where i [set membership] {0,1,2} indicates the number of disease predisposing alleles. Always assuming a sporadic rate of 0.01 and a disease predisposing allele frequency of 0.01, we performed model-based multipoint linkage analysis on each dataset under 18 different genetic models: (f0,f1,f2) = (0.01,0.1,0.1), (0.01,0.2,0.2),…,(0.01,0.9,0.9), (0.01,0.01,0.1), (0.01,0.01,0.2),…, (0.01,0.01,0.9) denoted D1,D2,…,D9, R1,R2,…,R9, respectively, and recorded the multipoint HLOD at the middle of the two markers. The linkage analysis was performed using MERLIN [Abecasis et al., 2002], in which the mixture likelihood was maximized using the Brent’s method [2002].

The simulation results confirmed the limiting distribution of Tmultipoint (= 2 × ln 10 × multipoint HLOD) to be equation M4. In Table I we estimated some parameters of the empirical distribution under analysis models D2, D5, D8, R2, R5, and R8 with varying sample sizes in each simulation scenario. The proportion of Tmultipoint equal to zero under different models always approximated to a half. The mean and variance of a random variable of equation M5 are 1 and 2, respectively. We observed the mean and variance of non-zero Tmultipoint approximated to 1 and 2, respectively, under most dominant models and high penetrance recessive models when the sample size was large. Under each analysis model the mean and variance approached their expectations under the theoretical distribution as the sample size increased. However, both mean and variance were smaller than their expectations under low penetrance recessive models even if the sample size was 5,000. In contrast, the maximum likelihood estimate of mean approximated to 1 when sample size was 500 under any analysis model. Given the sample size and analysis model, the estimated mean of Tmultipoint for data generated under simulation model I approximated to 1 closer than that for data generated under models II and III, which was mostly clearly illustrated when analyzing the data by low penetrance recessive models. We calculated the empirical type I error rate of the multipoint HLOD under analysis models D2, D5, D8, R2, R5, and R8 with varying sample sizes in each simulation scenario assuming that Tmultipoint follows a distribution of equation M6 (Table II). The results were consistent with the empirical distribution as summarized in Table I. The dominant analysis models gave proper type I error rate. The recessive models were conservative except for high penetrance models with large sample size. Figure 1 illustrated the empirical distribution of the multipoint HLOD under analysis models D5 and R5 with the theoretical distribution.

Fig. 1
Empirical distribution of multipoint HLOD compared with the theoretical distribution under the null hypothesis of no linkage. (A) and (C) correspond to the probability density functions of HLOD analyzed under models D5 and R5, respectively; (B) and (D) ...
TABLE I
Proportion of zeros among Tmultipoint and empirical distribution of non-zero Tmultipointa
TABLE II
Empirical type I error ratesa for multipoint HLOD

The multipoint HLOD method is powerful to detect linkage even when the assumed heterogeneity model is incorrect [Greenberg and Abreu, 2001; Hodge et al., 2002]. However, the distribution of the multipoint HLOD has remained mysterious, possibly because it has been obscured by the complexity of the two-point HLOD. Similarly, the model-based multipoint lod score that is best suited in an evidential paradigm [Hodge et al., 2008] displays some asymptotic complexity and does not have a limiting distribution [Xing and Elston, 2006]. We have shown in this paper that, in contrast with the two-point HLOD and the multipoint lod score, the multipoint HLOD test statistics follows a relatively simple asymptotic distribution. That is, 2 × ln 10 × multipoint HLOD follows an asymptotic distribution of equation M7. This not only enables evaluating the significance level easily, but also facilitates further inferences such as multiple testing correction. The rate of convergence to asymptotic distribution depends on the informativeness of both markers and the trait. Given data, the pre-specified analysis model defines the informativeness of the trait. As the model-based linkage statistics generally do, the multipoint HLOD under low penetrance recessive models is conservative because the phenotype contributes little information under such trait models, which is reflected as a high proportion of zeros, low mean and small variance (Table I). Similarly, data simulated under model I contain more trait information than that simulated under models II and III; therefore, the multipoint HLOD under simulation models II and III is more conservative than that under simulation model I. In multipoint analysis, the marker information is relatively constant across a map; thus, the behavior of Tmultipoint should also be relatively stable across the map given an analysis model. We note that the proportion of HLODs equal to zero is always greater than, though close to, one half; thus, a nominal P-value is presumably conservative. In this study we did not investigate the performance of the multipoint HLOD test statistic in the extreme tails of its null distribution, which will be crucial in determining genome-wide significance. Considering the efficiency of the test depends on multiple factors such as the true, yet unknown, disease model, analysis model employed, and sample size, when a large multipoint HLOD is observed in reality, it would be more appropriate to perform Monte Carlo simulations to evaluate the significance level [Lin and Zou, 2004].

ACKNOWLEDGMENTS

We thank Dr. Robert Elston for critically reading the manuscript and helpful discussions, and thank Dr. Gonçalo Abecasis for clarification and advice on the maximization procedure in MERLIN. C.X. was partially supported by a Pilot Award from UL1RR024982 from the National Center for Research Resources.

REFERENCES

  • Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002;30:97–101. [PubMed]
  • Abreu PC, Hodge SE, Greenberg DA. Quantification of type I error probabilities for heterogeneity LOD scores. Genet Epidemiol. 2002;22:156–169. [PubMed]
  • Brent RP. Algorithms for Minimization Without Dervatives. New York: Dover Publications; 2002.
  • Chernoff H, Lander E. Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. J Statist Plann Inference. 1995;43:19–40.
  • Chiano MN, Yates JR. Linkage detection under heterogeneity and the mixture problem. Ann Hum Genet. 1995;59:83–95. [PubMed]
  • Davies RB. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 1977;64:247–254. [PubMed]
  • Davies RB. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 1987;74:33–43. [PubMed]
  • Faraway JJ. Distribution of the admixture test for the detection of linkage under heterogeneity. Genet Epidemiol. 1993;10:75–83. [PubMed]
  • Greenberg DA, Abreu PC. Determining trait locus position from multipoint analysis: accuracy and power of three different statistics. Genet Epidemiol. 2001;21:299–314. [PubMed]
  • Hodge SE, Anderson CE, Neiswanger K, Sparkes RS, Rimoin DL. The search for heterogeneity in insulin-dependent diabetes mellitus (IDDM): linkage studies, two-locus models, and genetic heterogeneity. Am J Hum Genet. 1983;35:1139–1155. [PubMed]
  • Hodge SE, Vieland VJ, Greenberg DA. HLODs remain powerful tools for detection of linkage in the presence of genetic heterogeneity. Am J Hum Genet. 2002;70:556–559. [PubMed]
  • Hodge SE, Rodriguez-Murillo L, Strug LJ, Greenberg DA. Multipoint lods provide reliable linkage evidence despite unknown limiting distribution: type I error probabilities decrease with sample size for multipoint lods and mods. Genet Epidemiol. 2008;32:800–815. [PMC free article] [PubMed]
  • Huang J, Vieland VJ. The null distribution of the heterogeneity lod score does depend on the assumed genetic model for the trait. Hum Hered. 2001;52:217–222. [PubMed]
  • Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet. 1996;58:1347–1363. [PubMed]
  • Lemdani M, Pons O. Tests for genetic linkage and homogeneity. Biometrics. 1995;51:1033–1041. [PubMed]
  • Liang KY, Rathouz PJ. Hypothesis testing under mixture models: application to genetic linkage analysis. Biometrics. 1999;55:65–74. [PubMed]
  • Lin DY, Zou F. Assessing genomewide statistical significance in linkage studies. Genet Epidemiol. 2004;27:202–214. [PubMed]
  • Lindsay BG. Testing for Latent Structure. Mixture Models: Theory, Geometry and Applications. Hayward: Institute of Mathematical Statistics; 1995.
  • Ott J. Linkage analysis and family classification under heterogeneity. Ann Hum Genet. 1983;47:311–320. [PubMed]
  • Ott J. Analysis of Human Genetic Linkage, 3rd edition. Baltimore, MD: The Johns Hopkins University Press; 1999.
  • Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Statist Assoc. 1987;82:605–610.
  • Smith CAB. Testing for heterogeneity of recombination fraction values in human genetics. Ann Hum Genet. 1963;27:175–182. [PubMed]
  • Xing C, Elston RC. Distribution and magnitude of type I error of model-based multipoint lod scores: implications for multipoint mod scores. Genet Epidemiol. 2006;30:447–458. [PubMed]