Linkage analysis refers to a group of statistical methods that are used to map a gene to the region of the chromosome in which it is located. These methods take advantage of the fact that many more genes exist than chromosomes, and thus many genes are transmitted together from parents to offspring during meiosis. Linkage is the tendency of two or more genetic loci to be transmitted together during meiosis because they are physically close together on a chromosome. As such, linkage represents a violation of Mendel's law of independent assortment.
The concept that chromosomal segregation could explain the physical basis of Mendelian inheritance was first put forward by Sutton [
20,
21] in the early 1900's. Most early linkage studies were performed in plants and experimental animals. Correns [
22] reported the first linkage analysis in plants, with Bateson and Punnett [
23] observing the presence of recombinations between syntenic loci (i.e. genetic loci on the same chromosome). During the first meiotic prophase, pairing of the duplicated homologous chromosomes (synapsis) occurs. At this stage, a physical exchange of chromosomal material occurs between homologues. These exchanges are called chiasmata and lead to a ‘crossover’ of the DNA between the two homologues. These chiasmata occur frequently, but it is well known that the presence of one chiasma at a specific chromosomal location will decrease the chances that other chiasmata will form nearby (chiasma interference) [
24]. Thus, the probability that crossovers will occur between two syntenic loci is dependent on the distance between the loci [
25,
26], but the probability of double crossovers is disproportionately low between very close loci due to chiasma interference [
24]. Phase is a term that refers to which alleles at two syntenic loci are physically located together on the same homologue. Consider two syntenic loci,
A and
B, each with two alleles, A
1 and A
2, and B
1 and B
2, respectively. A person with genotypes A
1/A
2 and B
1/B
2 is a double heterozygote. There are two possible phases: (1) the A
1 and B
1 alleles reside together on one member of the chromosome pair and the A
2 and B
2 alleles on the other, or (2) the A
1 and B
2 alleles reside together on one homologue and the A
2 and B
1 alleles on the other. Only odd numbers of crossovers between the two loci can be detected by examining the genotypes of the parents and offspring because an even number will result in the original alleles at the two loci being transmitted together, maintaining the parental phase with respect to these two loci. When an odd number of crossovers occurs between two syntenic loci, then the alleles at these loci are recombined, i.e. transmitted to the offspring in a new combination or new phase. Two loci that are far apart on the chromosome (syntenic loci) have a high probability of recombination in any meiosis, such that they assort independently to offspring. Syntenic loci that are very far apart experience recombination about 50% of the time, and thus appear to be assorting independently, just as loci on different chromosomes do.
The recombination fraction measures the proportion of recombinations observed between two loci in a group of offspring. Linkage occurs when two loci are physically close enough so that alleles on the same homologous chromosome tend to be transmitted together, and no or very few recombinations are observed among the offspring. The recombination fraction, often represented as θ, is estimated by counting the number of offspring that show recombination for a given pair of loci, divided by the total number of offspring (the number of recombinants plus the number of non-recombinants). If two loci are physically next to one another, there is very little chance that a crossover will occur between them and the recombination fraction is close to zero. When the loci are on separate chromosomes or are far apart on the same chromosome, the recombination fraction is 1/2, with values between these two extremes indicating some degree of linkage.
Linkage analysis in humans is more difficult than in experimental organisms because of limitations in family size, the inability to do test crosses, the long generation time and lack of knowledge of phase in parents who are heterozygous at both loci being studied. Many approaches have been used over the years that aim to test, directly or indirectly, for lower than expected observed recombinations between two loci. These statistical approaches are of two basic types, often termed ‘parametric’ and ‘non-parametric’ linkage analysis.
Parametric or model-based or model-dependent linkage analysis (often called LOD score linkage analysis) assumes that the genetic models underlying both the trait and marker loci are known. Thus, assumed values (parameters) for qualitative traits that must be specified for use in the analysis include the allele frequencies at the trait and marker loci, dominance relationships among the alleles, and relationships between genotypes and phenotypes at both the trait and marker loci (penetrance). For quantitative traits, the parameters that must be specified include allele frequencies at the trait and marker loci, the means and variances of the phenotype for each genotype, and the relative frequencies of the genotypes. The main difference between parametric linkage analysis for qualitative and quantitative traits is that definitive recombinants can be identified for qualitative trait linkage analysis but not for the linkage analysis of quantitative traits. This is due to the nature of the models underlying each type of trait. Because normal probability densities are used to model the genotypic distributions in quantitative linkage analysis, and these densities asymptotically approach, but never reach, zero in both tails, every individual has a non-zero probability for having each genotype. This is problematic when trying to identify recombination events that help to localize candidate regions, but methods have been developed to classify individuals based on their most probable genotype [
27].
Non-parametric or model-free (or model-independent or weakly parametric) linkage methods make fewer assumptions about the underlying trait genetic model, although these methods still assume that the marker locus model(s) is known. These methods of analysis were first developed in the 1930's, with Fisher's [
28] publication of maximum-likelihood scoring procedures called
u-scores (parametric) and Penrose's [
29] development of the sib-pair method (non-parametric). Fisher's
u-scores and Finney's [
30,
31,
32,
33,
34,
35] extensions assumed specific models for the mating types at a trait locus and further assumed that the resulting score was normally distributed. Haldane and Smith [
36] developed an ‘inverse probability’ ratio test, now known as a likelihood ratio test, that is the basis of modern parametric likelihood ratio tests for linkage. In this test, given a particular set of data, the likelihood of a hypothesis of linkage with some specific recombination fraction (θ < 1/2) is compared to a hypothesis of no linkage, i.e. the independent assortment of the alleles at the two loci (θ = 1/2). Smith [
37] proposed taking the log of this test, and in 1955, Morton applied Wald's [
38] sequential probability ratio test to combine results from a series of families and to determine appropriate significance levels for this sequential test [
39]. Morton [
39] coined the term LOD score, although the term ‘LODs’ was originally defined by Bernard [
40] as the logarithm of the backward odds (the likelihood ratio). The two-point LOD score between a trait and a single marker locus is typically calculated over several recombination fractions between 0 and 1/2, and the recombination fraction that maximizes the likelihood (the maximum LOD score) is considered to be the best estimate of the recombination fraction. Traditionally, when the maximum LOD score is greater than 3 (a backward odds ratio of 1,000:1), the null hypothesis of independent assortment is rejected and linkage between the trait and the marker locus is assumed. Conversely, for those recombination fractions where the LOD score is less than −2, the null hypothesis of independent assortment is not rejected and linkage is assumed to be excluded. LOD scores can be converted to p values; a LOD score of 3 corresponds to a large-sample significance level of 0.0001 [
39,
41,
42] and a reliability of 0.991 [
43]. Morton subsequently extended the test to nuclear families, multiple allelic loci, sex linkage and genetic heterogeneity [
44,
45,
46].
Elston and Stewart [
47] developed a method (commonly called the Elston-Stewart algorithm) to compute the likelihood of a simple extended pedigree recursively and incorporated a general trait model that allowed for decreased penetrance and quantitative traits. Many types of trait models can be used with this algorithm. These are outside the scope of this overview, but comprehensive reviews are available in several articles and texts [
27,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64]. Ott [
65] implemented the Elston-Stewart algorithm to calculate the likelihood ratio test for linkage in human families of arbitrary size in LIPED, the first widely available computer program for this purpose. Many additional extensions to these methods have been published, including multipoint linkage analysis that uses information from multiple genetic markers, incorporation of variable age at onset and genetic heterogeneity, and methods that can analyze pedigrees with marriage or inbreeding loops [
49,
66,
67,
68,
69,
70,
71,
72,
73,
74,
75]. However, the computation time for multipoint linkage using the Elston-Stewart algorithm is prohibitive. Computation time for this algorithm scales linearly with the number of meioses but exponentially with the number of marker loci. Another major development was the Lander-Green algorithm for rapidly performing maximum-likelihood multilocus linkage computations [
67,
76,
77]. The computation time for this algorithm scales linearly with the number of markers; however, it is only suitable for small pedigrees since the amount of computer memory required becomes prohibitive in pedigrees with a large number of meioses. Algorithms that calculate approximations to the likelihood of a pedigree for multipoint linkage, such as SIMWALK2 [
78], offer a middle ground between these two options. Excellent treatments of these subjects are found in several reviews and texts [
79,
80,
81,
82,
83,
84,
85,
86,
87]. With the advent of dense maps of marker loci and multipoint linkage analysis (where the hypothesis of no linkage is tested assuming a recombination fraction of zero at thousands of locations along the chromosomal map), Lander and Kruglyak [
88] proposed alternative significance thresholds based on an ‘infinitely dense’ map of marker loci to control the genome-wide probability of observing a false-positive linkage at 5%. Their proposed ‘genome-wide significant’ threshold of a LOD of 3.3 (p = 4.9 × 10
–5) for parametric maximum-likelihood multipoint linkage analysis generated substantial controversy and methods development [
41,
89,
90,
91,
92,
93,
94,
95] but has become a fairly standard guideline, as have their suggested significance thresholds for non-parametric allele-sharing linkage analyses (e.g. 2.2 × 10
–5 in sibling pairs). Other factors that affect significance levels in linkage analyses are testing multiple parametric models [
96,
97,
98,
99,
100,
101], utilizing heterogeneity LOD scores [
102,
103,
104,
105], and the presence of intermarker linkage disequilibrium when using a linkage method that assumes linkage equilibrium [
106,
107,
108].
Non-parametric or model-free linkage methods do not require the specification of parameters for the mode of inheritance for the trait being linked to marker loci. These methods are based on testing whether relatives with similar trait phenotypes are also more similar than expected at a specific marker locus, implying low recombination rates between the unobserved trait locus and the specific marker locus. Non-parametric methods have also undergone substantial development since Penrose's introduction of the sib-pair test for qualitative and quantitative traits [
29,
109]. These early tests were based on the proportion of alleles that a sib pair shared identical-by-state (IBS), which is also sometimes called identical-in-state (IIS). The number or proportion of alleles at a locus that are shared IBS by a pair of individuals is based solely on sharing the same allele(s) at the marker locus. More recent methods of model-free linkage are usually based on identity-by-descent (IBD) sharing among relatives, that is, the number or estimated proportion of alleles at a locus that are shared by a pair of relatives because they are copies of the same ancestral allele (inherited from a common, recent ancestor). Haseman and Elston [
110] developed a model-free sib-pair linkage test based on estimates of IBD sharing among the sibling pairs for quantitative traits, and Suarez et al. [
111] developed a similar IBD-based sib-pair linkage test for a qualitative trait. Amos et al. [
112] extended these methods to other relative pairs in addition to sibs. Multipoint estimates of IBD sharing in sibling pairs at any genomic location were developed by Kruglyak et al. [
113] and Kruglyak and Lander [
114] based on the Lander-Green algorithm and later extended to additional types of relative pairs [
77].
These IBD estimates are utilized somewhat differently in model-free tests of linkage for quantitative and qualitative traits. For quantitative traits, Haseman and Elston [
110] proposed regressing the square of the difference of the trait values in the sibling pair against the estimated proportion of alleles shared IBD at a single marker locus with an extension to several loci without epistatic interaction. Amos and Elston [
115] extended this to the squared trait difference for various other types of relative pairs. The slope of this regression line is expected to be zero under the null hypothesis of no linkage, inferring that the estimated proportion of alleles shared IBD has no effect on the trait difference. Similarly, the slope of the regression is non-zero in the presence of linkage, so a one-sided t test for a non-zero slope is the test of interest. Further extensions were also made to allow for dominance variance and epistatic interactions [
116,
117,
118]. Variance components analysis has also been used for linkage for quantitative traits [
119,
120] by modeling the variance of the quantitative trait into components due to a causal gene linked to a specific location on the marker map and residual polygenic and environmental components. These methods have been extended to allow for analyses of large pedigrees [
121,
122]. Elston et al. [
123] introduced a revised Haseman-Elston regression method that has similar power to variance components methods. Several reviews of these methods exist [
124,
125,
126,
127].
For qualitative or dichotomous traits, one can utilize the methods for quantitative traits by simply coding affected individuals as ‘1’ and unaffected individuals as ‘0’ to create a quantitative phenotype and testing the difference between the means of the two groups. However, other approaches are often taken for qualitative traits, where the IBD sharing at marker loci is studied conditional on affection status. These methods include the ‘affected pairs’ methods. In 1953, Penrose [
128] introduced an affected sib-pair linkage test that tests whether the proportion of alleles IBD at a marker was larger than expected, and many other methods building on this concept have been proposed [
111,
129,
130,
131,
132,
133,
134,
135,
136,
137,
138,
139]. Tests for linkage when the trait is caused by multiple loci have also been developed [
140,
141,
142,
143,
144]. Tests have also been developed that allow all affected pairs in a pedigree to be tested for excess IBD sharing together [
135,
145,
146,
147,
148].
Parametric and non-parametric methods have different strengths and weaknesses [
149]. Parametric linkage analysis is more powerful than non-parametric linkage methods if the genetic model for both trait and marker loci are correctly specified; however, for complex traits where such correct model specification is difficult, nonparametric methods may be more powerful.