In tests of association in the presence of linkage (TALs) (aka, tests of linkage in the presence of association; joint tests of linkage and association), we wish to identify situations in which (A) genotypes at a marker locus (either directly or indirectly through intermediary phenotypes) cause variations in a phenotype; or (B) the marker locus is in linkage disequilibrium with another locus at which genotypes cause variations in the phenotype; and to distinguish those situations from (C) situations in which genotypic variation at the marker locus is correlated with (but not linked to) some other inherited factor that causes variation in the phenotype. Although less commonly discussed, we also wish to (D) identify marker loci that are both linked to loci that (either directly or indirectly through intermediary phenotypes) cause variations in a phenotype and, when certain other variables are conditioned on, also associated with loci that (either directly or indirectly through intermediary phenotypes) cause variations in a phenotype; even in situations where (E) genotypic variation at the marker locus is not associated with variations in the phenotype in the absence of conditioning on those other variables (i.e., when the association is
masked[
160] or
suppressed[
181]).
Everything in the preceding paragraph is just another way of saying we need to control for potential
confounding so that we may infer a causal influence of the marker locus itself or something in linkage disequilibrium with it on the phenotype. The ultimate source of the potential confounding that we wish to control for in TDT-type tests in
non-linkage disequilibrium(NLD), i.e., correlation or disequilibrium among unlinked loci. NLD can result from many sources including selection [
182], assortative mating [
145], and the admixture process [
16].
There is a rich literature on detecting causal effects in scientific research. To the extent that causation can ever be determined, most methodologists concur that we can have no stronger basis than a randomized experiment [
183]. This is because the act of randomization assures that, in the hypothetical population to which we wish to make inferences (not the specific sample in hand), there can be no association between the independent variable to which we assign subjects and
any variable that existed prior to randomization. Therefore, randomization is the only method that controls for both known and unknown sources of confounding. Thus, in an ideal world, we would randomly assign individuals to genotypes at marker loci and then do our tests without any concern of confounding. Of course, in reality, this is not possible. So what is the next best thing?
We can find the root of the next best thing in the work of Gregor Mendel's second law of genetics – the law of independent assortment. Mendel [
184] wrote ‘
All constant combinations which in peas are possible by the combination of the said 7 differentiating characters were actually obtained by repeated crossing. Their number is given by 27 = 128. Thereby is simultaneously given the practical proof that the constant characters which appear in the several varieties of a group of plants may be obtained in all the associations which are possible according to the laws of combination, by means of repeated artificial fertilization.’
2 A more formal statement of the law of independent assortment is ‘
When gametes are formed the alleles for one trait segregate independently of the alleles of a gene for another trait.’ In other words, Mendel believed that genes for different traits segregate independently. We now know that this is only true for genes at unlinked loci. Nevertheless, Mendel's second law implies that every act of meiosis is an act of randomization in which parents randomly assign alleles to the gametes they form from their available alleles. This further implies that,
conditional upon parents’ genotypes, all individuals have equal probability of inheriting (i.e., being assigned to) any particular genotype. Thus,
conditional upon parents’ genotypes, individuals are essentially randomized to genotypes. The only caveat (which often works in our favor in genetic research) is that the genotypes to which individuals are randomly assigned at one locus will be correlated with the genotypes to which they are assigned at other loci, but only when the loci in question are physically linked. Hence, conditioning on parents’ genotypes offers us a natural randomized experiment that eliminates the possibility of confounding by NLD. It does not eliminate potential confounding by LD, but this ‘confounding’ by LD is actually just what we are counting on to help us identify genes in many association studies (especially genome-wide association studies).
How can we condition on parents’ genotypes? There are several ways in which this can be achieved. The first and most straightforward way would be to begin with two individuals that are of the opposite sex and, at every locus, are homozygous. However, at many loci the two individuals will be different from each other. If such individuals produce a large number of offspring, these offspring will all be genetically identical and heterozygous at every locus at which the two parents differed. These offspring can then be intermated to produce another generation. In the second generation that descends from the original set of parents (conventionally denoted the F2 generation) every individual would have an equal probability of being assigned to each genotype compared with every other individual. Thus, individuals are essentially randomized to genotypes and we have the equivalent of a true experiment with randomization. This is essentially a description of a F2 cross among inbred lines that is classically used to map genes for complex traits in animals such as mice and flies. It is noteworthy that the individuals comprising the F2 population are admixed. And yet there is no concern about confounding due to admixture because all individuals have the same ancestry. As pointed out by Redden et al. [
16] this indicates that it is variation in ancestry and not variations in admixture per se that can cause confounding by NLD. Thus, the F2 cross among inbred lines can be seen as the geneticist's experiment in which meiosis is used to enact the process of randomization. It can also be seen as a precursor to the TDT.
Of course, we cannot set up inbred lines and do controlled breeding in humans. How then can we achieve similar objectives? We can do so by recognizing that in order for individuals to be assigned essentially at random (i.e. with equal probability across individuals) to genotypes at the marker locus, it is only necessary that their parents have the same genotypes at the marker locus, not that their parents are genetically identical with every other set of parents at all loci. Hence, we should select only individuals whose parents all had a common genotype at the marker locus. For example, if we had a locus that was di-allelic with alleles A and a, we could select only individuals in which one parent was AA and the other parent was Aa. In the offspring, we could then assess whether individuals who ‘randomly’ receive an ‘a’ allele from one of their parents tend to be phenotypically different than individuals that receive no ‘a’ alleles from their parents. Such a design would, at the locus in question, essentially recapitulate a backcross among an experimental population such as mice in which heterozygotes at the F1 generation are backcrossed to one of the parental strains. A design in which we only selected individuals whose parents had the genotypes Aa and Aa would essentially recapitulate an F2 cross at that locus.
In practice of course, the approach described in the preceding paragraph would be infeasible. Instead, rather than selecting individuals who only have parents with particular genotypes, we can statistically control for (i.e., condition on) the two parental genotypes (which we often denote mating types). This yields equivalent control because conditional upon the parental genotypes, the assignment to offspring genotypes is essentially random. Thus, our second way of achieving the benefits of randomization of allowing strong causal inferences and eliminating confounding by NLD is to statistically control for parental genotypes by directly observing them and including them in the statistical models. This is the basis of several TDTs [e.g. Allison, [
37]]. Using a similar argument Tiwari et al. [
26] and Beasley et al. [
185] apply the rules of randomization by conditioning on parental genotypes.
Of course, one may not be able to or wish to observe the genotypes of the parents themselves. One can then recognize that full siblings (by definition) share the same parents. Therefore, if one controls for sibship using studies of multiple siblings, one has effectively controlled for parents’ genotypes because all siblings have the parents with the same genotypes offering yet another way to effectively condition upon parental genotypes and enjoy the inferential strength that randomization offers.
As this discussion indicates, there are multiple variables one could control for that may yield valid inference in this context allowing the randomization by meiosis to eliminate confounding by NLD. Rabinowitz and colleagues [
39,
72,
84,
186,
187] have extended this idea to talk about conditioning on sufficient statistics. They seek to identify statistics that are ‘sufficient’ in the sense that if conditioned upon they would eliminate confounding by NLD. At root, this is still the same concept but expressed in a different form. This different expression of the concept is the basis for several other TDT type approaches [
37,
69,
78]. Horvath et al. [
188] express succinctly the importance of conditioning on sufficient statistics: ‘The general principle is to evaluate the distribution of test statistics using the conditional distribution of offspring genotypes under the null hypothesis, where the conditioning is on the sufficient statistics for any nuisance parameters in the model [
72]. The potential nuisance parameters for nuclear families include the distribution of the phenotypes, the parental allele frequencies, and the model for ascertainment. By conditioning the offspring genotype distribution on the phenotypes, one eliminates sensitivity of the tests to misspecification of the phenotype distribution and to ascertainment conditions that depend on the phenotypes. Conditioning on the parental genotypes eliminates sensitivity to population admixture, when parents’ genotypes are unknown. The procedures in Allison's TDTs [
37], George et al. [
41], Allison et al. [
69], and FBAT [
73,
78] all correct for association by conditioning on the parental genotype or transmission status of the individual.
Finally, a new class of tests known as structured association tests [
16] attempt to use the rest of the genome to derive, via various machinations, a variable that, if conditioned upon, would control for or eliminate NLD as a confounder. In the original formulations of such approaches, the variable one sought to control for was an index of genetic admixture under some assumption of a particular population dynamic including population admixture [
2,
8,
11,
12,
13,
19,
169,
172]. More recently, these approaches are being extended to allow for other background genetic factors [
10,
15]. It is important to note that unlike family-based TDT-type approaches which strictly eliminate confounding by NLD, such approaches as structured association testing only do so to the extent that one has effectively captured the important background covariates for inclusion in the model and modeled them successful [
16]. Expressed in this way, one can see structured association testing as essentially trying to achieve the same goals that propensity score analysis attempts to achieve in more general epidemiologic studies [
189,
190]. Indeed, our group is currently working on formalizing the propensity score analysis approach to structured association testing. Note that the genomic control method [
191] achieves valid inference by correcting the variance inflation factor rather than conditioning on sufficient statistics.