|Home | About | Journals | Submit | Contact Us | Français|
Linkage analysis was developed to detect excess co-segregation of the putative alleles underlying a phenotype with the alleles at a marker locus in family data. Many different variations of this analysis and corresponding study design have been developed to detect this co-segregation. Linkage studies have been shown to have high power to detect loci that have alleles (or variants) with a large effect size, i.e. alleles that make large contributions to the risk of a disease or to the variation of a quantitative trait. However, alleles with a large effect size tend to be rare in the population. In contrast, association studies are designed to have high power to detect common alleles which tend to have a small effect size for most diseases or traits. Although genome-wide association studies have been successful in detecting many new loci with common alleles of small effect for many complex traits, these common variants often do not explain a large proportion of disease risk or variation of the trait. In the past, linkage studies were successful in detecting regions of the genome that were likely to harbor rare variants with large effect for many simple Mendelian diseases and for many complex traits. However, identifying the actual sequence variant(s) responsible for these linkage signals was challenging because of difficulties in sequencing the large regions implicated by each linkage peak. Current ‘next-generation’ DNA sequencing techniques have made it economically feasible to sequence all exons or the whole genomes of a reasonably large number of individuals. Studies have shown that rare variants are quite common in the general population, and it is now possible to combine these new DNA sequencing methods with linkage studies to identify rare causal variants with a large effect size. A brief review of linkage methods is presented here with examples of their relevance and usefulness for the interpretation of whole-exome and whole-genome sequence data.
The identification of genes that contribute to the risk of a disease or the variation of a quantitative trait has long been one of the goals of human and medical genetics. Historically, both linkage and association methods have been used to narrow the location of such genes to small regions of the genome with the hope that the gene, and eventually the causal variant, may be identified and characterized. Recently, the focus in genetic epidemiology has been on genome-wide association studies (GWAS) because of the availability and affordability of dense sets of polymorphic genetic markers that can be genotyped on large numbers of individuals. GWAS examine common alleles for association with disease or trait phenotypes and have identified many regions of the genome that show such associations for a large number of important traits (http://www.genome.gov/26525384). It is important to note that these associations do not identify specific causal variants per se, but rather relatively short candidate regions that include the associated allele/variant and all of the variants that are highly correlated with it in a linkage disequilibrium block. Both theoretical and applied studies have shown that most of these associated loci have small effects on the disease or trait in question. Linkage studies, on the other hand, had been the study design of choice for many years, mostly because they were feasible with much less dense sets of genetic markers, and because linkage methods had the power to detect co-segregation over a much larger range. Linkage methods are particularly powerful for the detection of variants with a large effect size, which often are rare in the population. Power to detect such loci using linkage methods can be enhanced by ascertaining families with aggregation of the trait of interest (‘loaded families’). Like tests of association, linkage methods are also able to identify candidate regions, but the regions are much larger, sometimes spanning 40 Mb. Interest in these methods is undergoing a renaissance due to the availability of ‘next-generation’ DNA sequencing and its promise to allow identification of the rare variants underlying linkage signals.
Linkage studies have led to the identification of genes that cause or substantially increase the risk of many diseases and birth defects. For example, linkage analyses led to the identification of the genes that cause many Mendelian disorders such as cystic fibrosis (CFTR) [1, 2] and Huntington disease (HTT) [3, 4]. Linkage studies of families selected because of very strong aggregation of specific complex diseases have also led to the identification of rare, high-penetrance risk alleles in certain genes that cause large increases in susceptibility to complex diseases, for example the BRCA1 and BRCA2 genes and breast cancer. This study design has also led to the identification of genes with rare risk alleles that cause moderate increases in the risk for complex diseases such as the NOD2 gene and inflammatory bowel disease [5, 6, 7, 8]. Linkage methods have been successfully applied to quantitative traits as well. In a series of papers spanning almost 30 years, the specific activity of dopamine-beta hydroxylase activity, an enzyme that catalyzes the conversion of dopamine to norepinephrine, was localized to the chromosomal 9q34 region, and specific variants that were responsible, at least in part, for the variation of the trait activity have been identified [9, 10, 11, 12, 13, 14]. It is important to note that in this case the phenotype, the specific activity of the enzyme, is functionally closely related to the underlying structural locus. Other linkage studies, where the phenotype is not closely related to the underlying genotype, have not been as straightforward, or as successful.
However, for other Mendelian and complex disorders, linkage signals have been detected in family studies, but the causal genes and risk alleles have not yet been discovered. It is thought that this may be due to a variety of reasons, including (1) the previous high cost of DNA sequencing that precluded sequencing all genes under broad linkage peaks; (2) sequencing studies that included only exons of genes under linkage peaks, ignoring changes in regulatory regions; (3) clinical and genetic heterogeneity, and (4) false-positive evidence of linkage. While some significant linkage signals reported in the literature are almost certainly false-positive results , those that have been confirmed in independent sets of families are more likely to be true . Advances in our understanding of the complexities of the human genome have made it clear that sequencing only exons will not detect all DNA variants that contribute to disease risk or to variation in quantitative traits. New, cost-effective DNA sequencing methods have recently made it possible and economically feasible to combine linkage information with whole-exome or whole-genome sequence data to identify the causal variants that contribute to the linkage signals. For example, Sobreira et al.  combined linkage information for the Mendelian disease metachondromatosis (OMIM 156250) with whole-genome DNA sequence in a single proband to identify an 11-bp deletion in exon 4 of PTPN11, that alters the reading frame, resulting in premature translation termination, and that co-segregates with the phenotype. They confirmed this result by finding a different nonsense mutation in exon 4 of this gene that segregates with disease in another family. Bowden et al.  have used a similar strategy to identify a gene (ADIPOQ) contributing to variation of a quantitative trait, serum adiponectin level, by (1) identifying families responsible for a linkage signal to plasma adiponectin levels ; (2) performing whole-exome sequencing of 3 individuals in the two most strongly linked families with attention targeted to the region of the linkage peak, and (3) performing sequencing of the candidate gene ADIPOQ in additional samples from these two families and from unrelated individuals, showing that the risk allele is rare in the population (less than 2%) but that it accounts for the linkage signal in the two most strongly linked families. Several groups are now using similar approaches involving linkage in conjunction with whole exome sequencing, whole-genome sequencing or targeted next-generation sequencing, leading to a resurgence of interest in linkage analysis methods. Therefore, we briefly review classic linkage analysis methods here.
Linkage analysis refers to a group of statistical methods that are used to map a gene to the region of the chromosome in which it is located. These methods take advantage of the fact that many more genes exist than chromosomes, and thus many genes are transmitted together from parents to offspring during meiosis. Linkage is the tendency of two or more genetic loci to be transmitted together during meiosis because they are physically close together on a chromosome. As such, linkage represents a violation of Mendel's law of independent assortment.
The concept that chromosomal segregation could explain the physical basis of Mendelian inheritance was first put forward by Sutton [20, 21] in the early 1900's. Most early linkage studies were performed in plants and experimental animals. Correns  reported the first linkage analysis in plants, with Bateson and Punnett  observing the presence of recombinations between syntenic loci (i.e. genetic loci on the same chromosome). During the first meiotic prophase, pairing of the duplicated homologous chromosomes (synapsis) occurs. At this stage, a physical exchange of chromosomal material occurs between homologues. These exchanges are called chiasmata and lead to a ‘crossover’ of the DNA between the two homologues. These chiasmata occur frequently, but it is well known that the presence of one chiasma at a specific chromosomal location will decrease the chances that other chiasmata will form nearby (chiasma interference) . Thus, the probability that crossovers will occur between two syntenic loci is dependent on the distance between the loci [25, 26], but the probability of double crossovers is disproportionately low between very close loci due to chiasma interference . Phase is a term that refers to which alleles at two syntenic loci are physically located together on the same homologue. Consider two syntenic loci, A and B, each with two alleles, A1 and A2, and B1 and B2, respectively. A person with genotypes A1/A2 and B1/B2 is a double heterozygote. There are two possible phases: (1) the A1 and B1 alleles reside together on one member of the chromosome pair and the A2 and B2 alleles on the other, or (2) the A1 and B2 alleles reside together on one homologue and the A2 and B1 alleles on the other. Only odd numbers of crossovers between the two loci can be detected by examining the genotypes of the parents and offspring because an even number will result in the original alleles at the two loci being transmitted together, maintaining the parental phase with respect to these two loci. When an odd number of crossovers occurs between two syntenic loci, then the alleles at these loci are recombined, i.e. transmitted to the offspring in a new combination or new phase. Two loci that are far apart on the chromosome (syntenic loci) have a high probability of recombination in any meiosis, such that they assort independently to offspring. Syntenic loci that are very far apart experience recombination about 50% of the time, and thus appear to be assorting independently, just as loci on different chromosomes do.
The recombination fraction measures the proportion of recombinations observed between two loci in a group of offspring. Linkage occurs when two loci are physically close enough so that alleles on the same homologous chromosome tend to be transmitted together, and no or very few recombinations are observed among the offspring. The recombination fraction, often represented as θ, is estimated by counting the number of offspring that show recombination for a given pair of loci, divided by the total number of offspring (the number of recombinants plus the number of non-recombinants). If two loci are physically next to one another, there is very little chance that a crossover will occur between them and the recombination fraction is close to zero. When the loci are on separate chromosomes or are far apart on the same chromosome, the recombination fraction is 1/2, with values between these two extremes indicating some degree of linkage.
Linkage analysis in humans is more difficult than in experimental organisms because of limitations in family size, the inability to do test crosses, the long generation time and lack of knowledge of phase in parents who are heterozygous at both loci being studied. Many approaches have been used over the years that aim to test, directly or indirectly, for lower than expected observed recombinations between two loci. These statistical approaches are of two basic types, often termed ‘parametric’ and ‘non-parametric’ linkage analysis.
Parametric or model-based or model-dependent linkage analysis (often called LOD score linkage analysis) assumes that the genetic models underlying both the trait and marker loci are known. Thus, assumed values (parameters) for qualitative traits that must be specified for use in the analysis include the allele frequencies at the trait and marker loci, dominance relationships among the alleles, and relationships between genotypes and phenotypes at both the trait and marker loci (penetrance). For quantitative traits, the parameters that must be specified include allele frequencies at the trait and marker loci, the means and variances of the phenotype for each genotype, and the relative frequencies of the genotypes. The main difference between parametric linkage analysis for qualitative and quantitative traits is that definitive recombinants can be identified for qualitative trait linkage analysis but not for the linkage analysis of quantitative traits. This is due to the nature of the models underlying each type of trait. Because normal probability densities are used to model the genotypic distributions in quantitative linkage analysis, and these densities asymptotically approach, but never reach, zero in both tails, every individual has a non-zero probability for having each genotype. This is problematic when trying to identify recombination events that help to localize candidate regions, but methods have been developed to classify individuals based on their most probable genotype .
Non-parametric or model-free (or model-independent or weakly parametric) linkage methods make fewer assumptions about the underlying trait genetic model, although these methods still assume that the marker locus model(s) is known. These methods of analysis were first developed in the 1930's, with Fisher's  publication of maximum-likelihood scoring procedures called u-scores (parametric) and Penrose's  development of the sib-pair method (non-parametric). Fisher's u-scores and Finney's [30, 31, 32, 33, 34, 35] extensions assumed specific models for the mating types at a trait locus and further assumed that the resulting score was normally distributed. Haldane and Smith  developed an ‘inverse probability’ ratio test, now known as a likelihood ratio test, that is the basis of modern parametric likelihood ratio tests for linkage. In this test, given a particular set of data, the likelihood of a hypothesis of linkage with some specific recombination fraction (θ < 1/2) is compared to a hypothesis of no linkage, i.e. the independent assortment of the alleles at the two loci (θ = 1/2). Smith  proposed taking the log of this test, and in 1955, Morton applied Wald's  sequential probability ratio test to combine results from a series of families and to determine appropriate significance levels for this sequential test . Morton  coined the term LOD score, although the term ‘LODs’ was originally defined by Bernard  as the logarithm of the backward odds (the likelihood ratio). The two-point LOD score between a trait and a single marker locus is typically calculated over several recombination fractions between 0 and 1/2, and the recombination fraction that maximizes the likelihood (the maximum LOD score) is considered to be the best estimate of the recombination fraction. Traditionally, when the maximum LOD score is greater than 3 (a backward odds ratio of 1,000:1), the null hypothesis of independent assortment is rejected and linkage between the trait and the marker locus is assumed. Conversely, for those recombination fractions where the LOD score is less than −2, the null hypothesis of independent assortment is not rejected and linkage is assumed to be excluded. LOD scores can be converted to p values; a LOD score of 3 corresponds to a large-sample significance level of 0.0001 [39, 41, 42] and a reliability of 0.991 . Morton subsequently extended the test to nuclear families, multiple allelic loci, sex linkage and genetic heterogeneity [44, 45, 46].
Elston and Stewart  developed a method (commonly called the Elston-Stewart algorithm) to compute the likelihood of a simple extended pedigree recursively and incorporated a general trait model that allowed for decreased penetrance and quantitative traits. Many types of trait models can be used with this algorithm. These are outside the scope of this overview, but comprehensive reviews are available in several articles and texts [27, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]. Ott  implemented the Elston-Stewart algorithm to calculate the likelihood ratio test for linkage in human families of arbitrary size in LIPED, the first widely available computer program for this purpose. Many additional extensions to these methods have been published, including multipoint linkage analysis that uses information from multiple genetic markers, incorporation of variable age at onset and genetic heterogeneity, and methods that can analyze pedigrees with marriage or inbreeding loops [49, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75]. However, the computation time for multipoint linkage using the Elston-Stewart algorithm is prohibitive. Computation time for this algorithm scales linearly with the number of meioses but exponentially with the number of marker loci. Another major development was the Lander-Green algorithm for rapidly performing maximum-likelihood multilocus linkage computations [67, 76, 77]. The computation time for this algorithm scales linearly with the number of markers; however, it is only suitable for small pedigrees since the amount of computer memory required becomes prohibitive in pedigrees with a large number of meioses. Algorithms that calculate approximations to the likelihood of a pedigree for multipoint linkage, such as SIMWALK2 , offer a middle ground between these two options. Excellent treatments of these subjects are found in several reviews and texts [79, 80, 81, 82, 83, 84, 85, 86, 87]. With the advent of dense maps of marker loci and multipoint linkage analysis (where the hypothesis of no linkage is tested assuming a recombination fraction of zero at thousands of locations along the chromosomal map), Lander and Kruglyak  proposed alternative significance thresholds based on an ‘infinitely dense’ map of marker loci to control the genome-wide probability of observing a false-positive linkage at 5%. Their proposed ‘genome-wide significant’ threshold of a LOD of 3.3 (p = 4.9 × 10–5) for parametric maximum-likelihood multipoint linkage analysis generated substantial controversy and methods development [41, 89, 90, 91, 92, 93, 94, 95] but has become a fairly standard guideline, as have their suggested significance thresholds for non-parametric allele-sharing linkage analyses (e.g. 2.2 × 10–5 in sibling pairs). Other factors that affect significance levels in linkage analyses are testing multiple parametric models [96, 97, 98, 99, 100, 101], utilizing heterogeneity LOD scores [102, 103, 104, 105], and the presence of intermarker linkage disequilibrium when using a linkage method that assumes linkage equilibrium [106, 107, 108].
Non-parametric or model-free linkage methods do not require the specification of parameters for the mode of inheritance for the trait being linked to marker loci. These methods are based on testing whether relatives with similar trait phenotypes are also more similar than expected at a specific marker locus, implying low recombination rates between the unobserved trait locus and the specific marker locus. Non-parametric methods have also undergone substantial development since Penrose's introduction of the sib-pair test for qualitative and quantitative traits [29, 109]. These early tests were based on the proportion of alleles that a sib pair shared identical-by-state (IBS), which is also sometimes called identical-in-state (IIS). The number or proportion of alleles at a locus that are shared IBS by a pair of individuals is based solely on sharing the same allele(s) at the marker locus. More recent methods of model-free linkage are usually based on identity-by-descent (IBD) sharing among relatives, that is, the number or estimated proportion of alleles at a locus that are shared by a pair of relatives because they are copies of the same ancestral allele (inherited from a common, recent ancestor). Haseman and Elston  developed a model-free sib-pair linkage test based on estimates of IBD sharing among the sibling pairs for quantitative traits, and Suarez et al.  developed a similar IBD-based sib-pair linkage test for a qualitative trait. Amos et al.  extended these methods to other relative pairs in addition to sibs. Multipoint estimates of IBD sharing in sibling pairs at any genomic location were developed by Kruglyak et al.  and Kruglyak and Lander  based on the Lander-Green algorithm and later extended to additional types of relative pairs .
These IBD estimates are utilized somewhat differently in model-free tests of linkage for quantitative and qualitative traits. For quantitative traits, Haseman and Elston  proposed regressing the square of the difference of the trait values in the sibling pair against the estimated proportion of alleles shared IBD at a single marker locus with an extension to several loci without epistatic interaction. Amos and Elston  extended this to the squared trait difference for various other types of relative pairs. The slope of this regression line is expected to be zero under the null hypothesis of no linkage, inferring that the estimated proportion of alleles shared IBD has no effect on the trait difference. Similarly, the slope of the regression is non-zero in the presence of linkage, so a one-sided t test for a non-zero slope is the test of interest. Further extensions were also made to allow for dominance variance and epistatic interactions [116, 117, 118]. Variance components analysis has also been used for linkage for quantitative traits [119, 120] by modeling the variance of the quantitative trait into components due to a causal gene linked to a specific location on the marker map and residual polygenic and environmental components. These methods have been extended to allow for analyses of large pedigrees [121, 122]. Elston et al.  introduced a revised Haseman-Elston regression method that has similar power to variance components methods. Several reviews of these methods exist [124, 125, 126, 127].
For qualitative or dichotomous traits, one can utilize the methods for quantitative traits by simply coding affected individuals as ‘1’ and unaffected individuals as ‘0’ to create a quantitative phenotype and testing the difference between the means of the two groups. However, other approaches are often taken for qualitative traits, where the IBD sharing at marker loci is studied conditional on affection status. These methods include the ‘affected pairs’ methods. In 1953, Penrose  introduced an affected sib-pair linkage test that tests whether the proportion of alleles IBD at a marker was larger than expected, and many other methods building on this concept have been proposed [111, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139]. Tests for linkage when the trait is caused by multiple loci have also been developed [140, 141, 142, 143, 144]. Tests have also been developed that allow all affected pairs in a pedigree to be tested for excess IBD sharing together [135, 145, 146, 147, 148].
Parametric and non-parametric methods have different strengths and weaknesses . Parametric linkage analysis is more powerful than non-parametric linkage methods if the genetic model for both trait and marker loci are correctly specified; however, for complex traits where such correct model specification is difficult, nonparametric methods may be more powerful.
Linkage studies have been successful in leading to the discovery of genetic loci that contribute to the risk of diseases or variation of quantitative traits. For example, the BRCA1 gene with hundreds of rare, high-penetrance risk alleles that cause major increases in the risk of breast and ovarian cancer was first discovered  after its location was identified via a linkage study . Since then, more genes have been identified with inherited mutations that predispose to breast or ovarian cancer, but most risk alleles in these loci are individually rare in the population. Walsh et al.  recently showed that genomic capture and massively parallel sequencing of these genes can detect a wide array of known mutations in 21 of these breast-ovarian cancer risk loci.
However, for many traits, existing genome-wide significant  and replicated linkage results have not resulted in the identification of the genes responsible for the linkage signals. There are several possible reasons for this phenomenon. First, as with any statistical test, false-positive results (type I errors) are to be expected. The more linkage studies that are carried out for a specific disease or trait, the higher the chance that a highly significant false-positive linkage will be observed . While genome-wide significant linkage results have high reliability [41, 43, 100], linkage results with ‘suggestive’ significance levels are not as reliable and have a higher probability of being false-positive results . However, performing linkage analyses appropriately (without violation of the assumptions of the analysis methods), calculating significance levels appropriately for the specific analysis methodology [100, 105], requiring stringent significance levels to declare ‘genome-wide significance’  and also requiring replication of a significant linkage in an adequately powered [88, 153] independent dataset (a practice that is also common in GWAS) before considering a linkage result to have strong support, will help to control this . However, such stringent control of false-positive rates will decrease power to detect linkage. A second reason that significant linkage results have not resulted in the identification of the genes responsible for the linkage signals is that the correct gene has been identified, but its function and the effect of mutations on this function are not yet well enough understood for researchers to realize that it is the causal locus. In addition, gene-gene or gene-environment interactions may cause inconsistent associations when mutations discovered in linked families are subsequently evaluated in population-based association studies. However, for many linkage findings, the reason that a causal locus has not been found may be that adequate DNA sequencing has not yet been performed. In the past, when only Sanger sequencing methods were available, DNA sequencing of the entire region under a linkage peak was prohibitively expensive because these linked candidate regions can often cover 100–200 megabases. Sequencing in these regions has often been limited to only a few exons in a few candidate genes. As the Human Genome Project has progressed, our understanding of DNA structure and function has grown, such that we now realize that we must sequence not just exons but also promoters, splice sites, 3′ UTRs, microRNAs, long non-coding RNAs and other non-coding regulatory elements. For many candidate linkage regions, the failure to identify the causal disease gene may simply mean we have not yet sequenced enough DNA in the region on a large enough sample of people. Next-generation DNA sequencing holds the promise to allow us to eliminate the last two possibilities by making it economically feasible to thoroughly sequence the DNA of an adequate number of affected individuals for many diseases. However, we must recognize that these methods are not a panacea, and that complex diseases and traits are indeed complex.
As large samples of whole-exome and whole-genome sequence data have been accumulated, certain issues have become clear. First, rare variants are individually rare, but each person will have thousands of such rare variants across their genome. It can be difficult to determine whether a novel variant is a sequencing artifact or whether it is a true variant when only a single individual in a sample exhibits this variant. However, one expects that even rare variants should segregate within a family. Thus, family studies of DNA sequence data can be useful for determining which rare variants are likely to be real variants and also which variants segregate with a disease or trait within the family, and analogously whether copy number variants are likely to be inherited or novel, although identifying variants based on repeated sequences is still somewhat problematic at this point in time. This method of measuring the co-segregation of any sequence variant with disease is simply linkage analysis. Linkage analysis results can be used to identify families that are most likely to segregate genetic variants and to guide interpretation of whole-exome and whole-genome sequencing results or to choose regions for targeted DNA sequencing. Results from the recent Genetic Analysis Workshop 17 suggested that analyses of rare variants in whole-exome sequence data would require much larger sample sizes in studies of unrelated individuals than in family studies, since family studies allow amplification of effect of the rare variants because many family members carry the same rare variant . Combining linkage studies with sequencing can allow the identification of important genes and gene pathways, which can then become candidates for sequencing in much larger samples of individuals with the pertinent disease or trait.