|Home | About | Journals | Submit | Contact Us | Français|
Human quantitative trait locus (QTL) linkage mapping, although based on classical statistical genetic methods that have been around for many years, has been employed for genome-wide screening for only the last 10-15 years. In this time, there have been many success stories, ranging from QTLs that have been replicated in independent studies to those for which one or more genes underlying the linkage peak have been identified to a few with specific functional variants that have been confirmed in in vitro laboratory assays. Despite these successes, there is a general perception that linkage approaches do not work for complex traits, possibly because many human QTL linkage studies have been limited in sample size and have not employed the family configurations that maximize the power to detect linkage. We predict that human QTL linkage studies will continue to be productive for the next several years, particularly in combination with RNA expression level traits that are showing evidence of regulatory QTLs of large effect sizes and in combination with high-density genome-wide SNP panels. These SNP panels are being used to identify QTLs previously localized by linkage and linkage results are being used to place informative priors on genome-wide association studies.
The basic statistical genetic methods that are used today for human quantitative trait locus (QTL) linkage mapping are extensions of classic models that were available twenty years ago at the second International Conference on Quantitative Genetics (ICQG2). The methods currently used for human QTL linkage mapping derive from classical variance component methods(1) and from the sibling pair regression methods proposed by Haseman and Elston(2). These methods are based on the simple intuition that relatives who are more alike phenotypically should be more likely to share alleles at the QTL (and at nearby linked markers) than relatives who are phenotypically discordant. Identity by descent (IBD) allele sharing is used to quantify the proportion of alleles a relative pair shares at a particular location in the genome that are copies of the same ancestral chromosome. In the simplest case of a sample of independent sibling pairs, the original Haseman-Elston linkage method proposed that the squared difference in the siblings' phenotypes can be regressed on IBD with evidence of linkage coming from a negative regression slope indicating that siblings who are relatively more discordant share fewer alleles at a marker than siblings who are relatively more phenotypically concordant. The variance component (1,3) and the newer revised Haseman-Elston methods (4,5) are somewhat more complex, allowing for the non-independence of multiple relative pairs from the same pedigree and incorporating polygenic background effects, effects of measured environmental covariates, and gene-gene and gene-environment interactions. The variance component and revised Haseman-Elston linkage approaches have been shown to be asymptotically equivalent, with the maximum likelihood-based variance component approaches being slightly more powerful and the regression-based Haseman-Elston being more robust to non-normality in the trait distribution(4). Newer incarnations of the same basic model use Markov Chain Monte Carlo (MCMC) methods for parameter estimation and incorporate the number of QTLs as an additional parameter(6).
Despite the fact that the basic statistical genetic methods for human QTL linkage were in place at the time of ICQG2, linkage studies for loci influencing quantitative traits in humans were still in their infancy and were generally limited to a few markers, usually in candidate genes. Complete genome-wide screens for QTLs in humans did not begin to appear until the mid-1990s(7-9). The post-ICQG2 advances that have made whole genome QTL linkage mapping possible have been primarily technological. In 1987, a genetic map of human chromosome 7 was published with 63 two-allele restriction fragment length polymorphism (RFLP) markers spaced roughly every 3 cM and this was one of only a few chromosomes with complete coverage in a linkage map(10). Today, we have dense maps of highly polymorphic short tandem repeat (STR) markers and even more dense maps of less polymorphic single nucleotide polymorphisms (SNPs) with known order and map position. The UCSC Genome Browser (http://genome.ucsc.edu) currently shows > 2,200 known STRs on chromosome 7 and > 670,000 SNPs. Additionally, the methods for genotyping these markers have become increasingly efficient, both in terms of cost per marker and in terms of the time to genotype each marker, with high-throughput genotyping technologies contributing to both of these factors. Computing, a rate limiting resource in the past, has also become faster and cheaper.
Although the basic statistical methods for assessing evidence of linkage in human pedigree data were in place 20 years ago, there have been some developments in statistical methods relating to estimation of IBD allele sharing using multiple markers in large pedigrees. The Elston-Stewart algorithm, used in the LINKAGE package, and the Lander-Green algorithm, used in programs such as Merlin, Allegro, and GeneHunter, each provide analytical formulae for estimating IBD allele sharing using multiple genotyped markers and properly weighting over all possible inheritance vectors. However, each of these algorithms has a limitation. The computing time for the Elston-Stewart approach is exponential with the number of markers but linear in the number of individuals in a pedigree, making it suitable for very large families but limiting the number of markers that can be used, an unsatisfactory state of affairs given the dense sets of STRs and SNPs available today. The Lander-Green algorithm, on the other hand, is linear in computing time with the number of markers but exponential with increasing pedigree size. Thus is suitable and widely used for studies with smaller families, but as larger pedigrees increase power to detect linkage, there was a need for programs that would facilitate the analysis of a dense map of markers in pedigrees of unlimited size and complexity.
Two main approaches have emerged for dealing with IBD estimation with an unlimited number of markers in pedigrees of unlimited size. The first of these involved a linear function of the IBDs at individual markers with the function depending on the probability of recombination between the markers, given their distance from each other and the type of relative pair involved. This was first proposed by Fulker et al for sibling pairs(11) and later expanded by Almasy and Blangero for larger pedigrees(12). Although this approach provided improved power and more accurate localization than sequential analyses of single markers, it was limited by the information content provided by each of the markers individually (i.e. better for relatively informative STRs than for less heterozygous markers such as SNPs) and has been for the most part abandoned in favor of more computer intensive, but more accurate, MCMC-based IBD estimation implemented in programs such as SimWalk2(13) and Loki(14). MCMC approaches to IBD estimation with multiple markers are based on sampling the space of possible inheritance vectors that are consistent with the observed genotypes in a family and estimating IBD allele sharing for each relative pair by weighting the IBD value for each inheritance vector by the likelihood of that inheritance vector given the observed genotypes.
There are numerous examples in recent years of successful human QTL linkage mapping studies. A number of these findings have been replicated in independent samples, a handful have been followed-up to the point of identifying a specific gene within the QTL linkage region, some of these have implicated specific variants within the gene as being potentially functional, and a few of those variants have been confirmed by in vitro functional assays.
One of the first widely replicated human QTL linkages is a peak for obesity-related traits on chromosome 2p. It was initially reported by Comuzzie et al. in 1997 as a LOD score of 4.95 for levels of leptin, an adipocyte-derived hormone, with weaker evidence for linkage of a QTL influencing fat mass to the same location in a sample of Mexican American families(15). Linkage of a QTL influencing leptin levels to chromosome 2p markers has since been replicated in French(16) and African American populations(17) and linkage to this region has also been observed for the related phenotype of body mass index(18,19). Despite the strong linkage signals and the replication of this peak in multiple independent populations, identification of the gene(s) underlying this QTL has proved elusive. An obvious positional candidate under the linkage peaks, the pro-opiomelanocortin (POMC) gene, was resequenced and two rare coding changes with minor allele frequencies of < 0.01 were identified. There was also good evidence for a third potentially functional variant, an insertion/deletion polymorphism in an intron. However, these variants only partially explain the observed linkage signals, suggesting that a second as yet unidentified gene may also contribute to this QTL(20,21).
Some replicated QTL linkages have been followed up to the point of identification of specific associated genes under the linkage peak, some of which are now being replicated in genome-wide association screens. Duggirala and colleagues reported linkage of a QTL influencing age at onset of type 2 diabetes and diabetes itself, analyzed using a variance component-based liability threshold model, to chromosome 10q(22). Another group found confirmatory linkage evidence in the same chromosomal region(23) and eventually identified the TCF7L2 gene under their linkage peak as being strongly associated with type 2 diabetes(24). Association with TCF7L2 has now been verified in the Mexican American sample in which the original linkage was reported(25). TCF7L2 has since been picked up by multiple genome-wide association screens(26,27).
A number of human QTL linkages have not only genes, but also specific functional variants that have been identified and confirmed through in vitro functional studies. One of these is Factor VII (FVII), a protein involved in blood clotting that has been implicated in risk of venous and arterial sclerosis. A genome-wide linkage study identified a QTL for FVII on chromosome 13 with a LOD score of 3.18(28). The structural gene for FVII, F7, was directly under this linkage peak and was an obvious positional candidate. Association analyses of the 48 polymorphisms identified in the F7 gene by sequencing in individuals from the linkage sample identified four variants with a very high posterior probability of effect and three additional variants with high support for functionality. Several of these variants were in putative promoter regions and their functionality has now been confirmed in in vitro expression assays(29).
Other well replicated human QTL linkages include loci influencing variation in triglyceride levels on chromosome 15 (30-33), body mass index on chromosome 3 (34-36), and reading disability on chromosome 6p (7,37-39) to name only a few. Another example of a replicated QTL linkage with a specific gene identified under the linkage peak by follow-up association studies is a chromosome 4 QTL influencing alcohol dependence and related endophenotypes that show association with GABRA2(40-42). Further cases of specific functional variants that have been identified in genes originally localized through QTL linkages include three coding variants in a member of the TAS2R bitter taste receptor family that influence ability to taste phenylthiocarbamie (PTC)(43,44) and a promoter variant in the SEPS1 gene, in a region of chromosome 15 linked to a variety of inflammatory disorders, that affects inflammatory response(45).
Although relatively few human QTL linkages have been followed up to the point of identifying the underlying gene(s) and functional variants, we may still be able to draw some conclusions about the spectrum of human QTLs from these examples. Given the sample size, these observations are necessarily anecdotal, but they provide some idea of what is possible within the range of human QTLs, if not a quantitative measure of how likely these possibilities are.
We may not be able to rely on picking an obvious positional candidate under a linkage peak. In the days of Mendelian penetrance model-based linkage, an initial linkage finding was generally followed up by fine mapping and genotyping of additional markers to more precisely define the recombinations within families and the minimal chromosomal area in which the gene of interest resided. Today, this fine mapping stage has largely been abandoned as the results of the Human Genome Project have made it easy to examine the catalog of genes in the region under a linkage peak to select one or a few obvious candidates to test as potential sources of a QTL linkage. Some of the known human QTL examples suggest that the obvious positional candidate gene may not be the correct one or that the obvious candidate may not be sufficient to explain the linkage in the case of multiple QTLs under a peak. As SNP genotyping becomes faster and cheaper, our reliance on choosing positional candidates is decreasing as it is more and more feasible to comprehensively screen all genes in the region under a linkage peak for association with the trait of interest. Currently, opinions differ on whether this association screening should be gene-centric, including known coding variants in the region, or should consist of markers of a uniform density selected to cover the whole region in a systematic fashion. This is likely to be a short-lived debate as sequencing technology becomes cheap enough to feasibly sequence the entire region under a linkage peak.
A linkage peak may represent the cumulative effects of more than one gene. For example, although there is strong evidence that the coding variants in the POMC gene affect leptin levels, these variants can not account for the linkage peak and there is also strong association with additional variants, not in linkage disequilibrium with the POMC markers, in another nearby gene.
There may be multiple functional variants, multiple quantitative trait nucleotides (QTNs), in each QTL, as was found with clotting Factor VII and the F7 gene, with ability to taste PTC and the TAS2R bitter taste receptor gene, and with leptin and the POMC gene. This allelic heterogeneity has implications for association studies. The power to find these genes by linkage involves the total variance in the trait explained by the QTL, which is the summed variance of the QTNs. The power to detect these genes by association, however, is a function of the individual variance attributable to a QTN and the strength of the linkage disequilibrium between that QTN and a genotyped marker. In the specific case of F7, the total effect size of the QTL was roughly 35%, but a single one of the QTNs accounted for roughly 14% of the trait variance and may have been easily detectable by association, particularly as this variant was common (minor allele frequency of 16.6%) and in strong linkage disequilibrium with other markers in the gene. However, in other cases, it may be easier to detect the QTL, the collective effect of multiple QTNs, by linkage than to detect any of the individual QTNs by association. For example, the insertion/deletion polymorphism in POMC had a minor allele frequency of 0.06 and both of the coding variants in this gene had frequencies of < 0.01 and low linkage disequilibrium with surrounding markers. However, collectively these three variants account for approximately 11% of the variation in leptin levels.
For the few human QTLs for which specific functional variants have been identified, it appears that the QTNs behind human QTLs consist of both rare and common variants and both coding and non-coding variants. In POMC and in F7 there were rare SNPs, some present in only a single family in the sample, that produced a large effect on each individual who carried them through changes in the amino acid structures of the proteins. In F7, there were also putative regulatory variants, some now confirmed through in vitro functional studies, with allele frequencies > 20%.
Despite the success of many genome-wide linkage scans for quantitative traits in humans, there is a common perception that although linkage analyses were excellent for localizing genes influencing simple Mendelian traits, they do not work for complex traits, a category that is usually defined as traits influenced by multiple genes and their interactions with each other and with the environment. And indeed, there are many examples of unsuccessful genome-wide quantitative trait linkage screens for a variety of complex phenotypes in humans.
The power of quantitative trait linkage methods depends on the proportion of variance due to the QTL being sought, the total heritability of the trait, the density and heterozygosity of the genotyped markers, the sample size, and the configuration of these individuals in families(46). The informativeness of the genotyped markers is generally a minor issue as there are well-established linkage maps in humans with a high density of microsatellites with very high heterozygosity or an even higher density of SNPs providing equivilent or better information about segregation within families. Additionally, the multipoint IBD methods discussed above have made it possible to estimate IBD allele sharing with a high degree of accuracy. Of the other factors relating to power, it is possible, or even likely, that some heritable quantitative traits of interest in humans represent the aggregate effects of many QTLs, none of which has an individual effect size large enough to be detected in linkage studies. However, for many human linkage studies, the limiting factor has been sample size or the family configurations available for study.
For a fixed sample size, the power to detect linkage for a quantitative trait is maximized when those individuals are concentrated into as few families as possible so that the number of relative pairs in the sample is maximized. For example, if I can afford a sample size of six individuals in my study, I could collect three families with two siblings each for a total of three relative pairs, two families with three siblings each for a total of six relatives pairs, or one family of six siblings for a total of 15 relative pairs. Concentrating the individuals into fewer families dramatically increases the number of comparisons and contrasts that can be made among relatives in the study. Collecting large families can be challenging and the goal of six siblings per family may not be possible in many western populations. However, large sibship sizes are still the norm in some population groups. Alternatively, the same number of relative pairs can be obtained by collecting our six individuals in three sibling pairs, provided these three sibling pairs are cousins of each other. In this case, the result is still a total of 15 relative pairs in a single family.
Blangero et al(47) expressed this more formally, comparing the relative efficiency of quantitative trait linkage studies in families of different configurations. A relative efficiency of 1 was assigned to a linkage study of a population isolate in which all of the individuals studied coalesce into a single pedigree of 2000 individuals, an idealized situation but an attainable one given that it was based on the actual pedigree structure of an existing study. Compared to this ideal, the relative power per person studied is only 0.04 for sibling pairs, 0.11 for nuclear families with three siblings examined, or 0.17 for nuclear families with four siblings examined. Nuclear families, even large nuclear families, provide relatively poor power for quantitative trait linkage. Studies of American or European families of moderate size, an average of 19 or 31 individuals per multigenerational pedigree, provided much better power, with relative efficiencies in the range of 0.35 – 0.59.
The smaller the family configuration, the larger the sample size will need to be to maintain an equivalent level of power in a human QTL linkage study. Williams and Blangero(46) provide analytical formulae for estimating the power to detect linkage for a quantitative trait in a variance component-based analysis as a function of the QTL-specific heritability, the total additive genetic heritability, the sample size, and the family configuration. For a QTL-specific heritability of 0.20, to achieve 80% power to detect linkage at a LOD score of 3 requires approximately 1000 individuals in extended pedigrees of size 48, > 2000 individuals in sibships of size 4, or > 6000 individuals in sibling pairs. Alternatively, for a sample size of 1000 – 2000 individuals, one has 80% power to detect linkage at a LOD score of 3 for QTL-specific heritabilities in the range of 0.14 – 0.20 if the sample is composed of pedigrees of 48 individuals each or 0.35 – 0.5 if the sample consists of sibling pairs.
A QTL-specific heritability of 0.14 is still quite large, even if we consider that it may be spread over multiple QTNs within the QTL, and it is likely that many traits of interest will not have any single QTL that contributes so much to the trait variance. However, the examples above show that some traits will have QTLs with effect sizes in this range and that linkage will work in some cases. It is impossible to know beforehand which traits these will be, just as it is impossible to know which traits will be influenced by common variants likely to be captured by the SNP panels used in genome-wide association studies. Given which, it seems likely that the complementary approaches of genome-wide linkage, optimized to find QTLs of large effect that may represent the joint effects of multiple QTNs in a region, and genome-wide association, optimized to find common QTNs of small effect, will both be useful for gene localization for complex traits, providing potentially different and complimentary results.
One way to try to increase the chances that a trait is influenced by fewer QTLs of larger single effects may be to try to select phenotypes that are likely to be less complex and closer to the level of gene action. This is the rationale behind many studies of quantitative traits as risk factors for complex diseases - that intermediate traits or endophenotypes may be less complex and closer to the level of gene action than disease endpoints(48-51). A new class of quantitative traits, RNA expression levels, is showing great promise in this respect(52,53). These traits are measurements of the amount of transcribed RNA for a particular gene, reflecting levels of gene expression. It is possible to assay expression levels for tens of thousands of human genes in a single experiment using chips that have been designed for this purpose. Göring et al.(53) reported that of the 20,413 transcripts with detectable expression in lymphocytes, approximately 85% had significant heritabilities and over a thousand of these traits had QTLs that were genome-wide significant at an experiment-wide false discovery rate of 0.05. For QTL linkages at the position of the structural gene coding for the RNA whose transcript levels were measured, the median QTL-specific heritability for the QTLs that met that 0.05 false discovery rate cut-off was 24.6%. Although these QTL-specific effect sizes are likely to be overestimates(54,55), they still suggest that there are many human QTLs with substantial effect sizes that can be detected by linkage. Göring et al. used these linkages of RNA expression traits to their structural loci as a starting point for identifying QTNs influencing traits of biomedical interest. They identified transcript levels with linkages to their own structural gene that were correlated with HDL cholesterol levels, sequenced the promoter regions of these genes to identify the QTNs responsible for the variation in gene expression level, and tested whether these QTNs were also associated with variation in HDL cholesterol levels. Essentially, they were able to identify QTLs with relatively modest effects on HDL cholesterol levels by first localizing them through linkage analyses of correlated traits that were closer to gene action, i.e. RNA expression levels.
Other recent advances in molecular technology are also changing the uses and trajectory of human QTL linkage studies. It is now technologically and economically feasible, if not quite routine, to genotype a million or more SNPs genome-wide in a large sample of individuals. Initially the markers in these SNP panels were selected to provide good coverage of common variation in genome-wide association studies. That is, the panels were composed of markers with high allele frequencies chosen because they represented a group of markers that were in linkage disequilibrium with each other and with the ‘tag’ SNP selected for inclusion in the panel. Newer panels also include rarer variants, particularly known coding SNPs that change the amino acid sequence of a protein, and markers selected to identify polymorphic copy number variations in humans. Whole genome association studies with these panels are now rivaling linkage as a method for gene localization.
As discussed above, although they are often viewed as being in competition, linkage and association methods are complementary and are optimized to find different sorts of QTLs. Linkage has an advantage when the minor allele of a QTL is rare in the population but has a large effect in each individual who carries the variant allele or when there are multiple QTNs in a region that add up to a substantial total QTL effect size. Association has the advantage when a QTL is due to one or a few QTNs that are among the genotyped markers on a SNP panel or are common but of modest effect in each individual who carries the variant allele. Additionally, the two approaches can be used jointly to localize and identify QTLs. For a large study with dozens of quantitative phenotypes, association analyses with markers from a genome-wide SNP panel may be an economical way to follow up the multiple linkage signals in multiple chromosomal regions for the different phenotypes, helping to move from a general region of linkage to a specific gene or genes under the linkage peak. Conversely, linkage results, even ones that are not significant on a genome-wide level, may help to inform genome-wide association studies. Roeder et al. proposed that linkage results could be used to adjust p-values in genome-wide association studies, up-weighting the association results in regions with prior evidence for linkage with the weighting factor depending on the strength of the linkage signal(56). Sun et al. proposed a more categorical approach to the same issue, suggesting that false discovery rate could be controlled in a stratified manner, with SNPs being grouped by factors such as whether they were in a region with prior evidence of linkage(57).
The next logical step in the progression of genotyping technology for human gene mapping is to move to complete resequencing for purposes of genome scanning. At present, this is infeasible due to the cost and speed of available sequencing platforms. However, the X Prize Foundation has issued a challenge with a reward of 10 million dollars to the team that can sequence the complete genomes of 100 individuals in no more than 10 days at a cost of no more than $10,000 per sample(58). If this challenge is met, it is a short step from that point to a situation where it becomes economically feasible to resequence 1000 or more individuals for a QTL localization study. When we have complete sequence data, we will be able to guarantee that any functional variants, any QTNs, that are present in a sample are among the genotyped markers and we will be able to test each variant in a fixed effects model that will provide more power than either association studies that depend on linkage disequilibrium between markers or linkage studies using random effects models. This is the juncture at which we believe linkage studies will become obsolete and a whole new class of quantitative genetic methods will be required to sift through the vast amounts of sequence data in ways that maintain power and control false positives in the face of massive multiple testing, take into account the structure within the genome and the functional grouping of DNA segments into genes and regulatory elements, and take advantage of biological knowledge regarding related genes, traits, and pathways.
This work was supported in part by NIH grants R01 MH59490, R01 GM31575, U10 AA08403, R01 HL45522, R01 MH78111, and R01 HL070751. Thanks also to the NIH and NSF for travel support to attend ICQG3.