There are a number of phenomena that must be considered in linkage and association analyses which often require either very strict assumptions or appropriate accommodation in relevant statistical models. We briefly describe a number of these with reference to .
Multifactorial, Polygenic Disease Basis
Most neuropsychiatric diseases, such as schizophrenia, are complex and multifactorial in that many genes and environmental factors contribute to their expression. This obviously complicates statistical analysis because any one gene may have its effects obscure the effects of other genes and environmental stimuli (for detailed reviews of the genetic analysis of schizophrenia and related diseases, see Gershon and Badner,16
Owen et al,17
Riley et al,18
Riley and Kendler20
). Given that genes work in tandem or in combination through networks, hypotheses have been put forward as to the genetic basis of monogenic diseases (ie, those influenced primarily by perturbations in a single gene) and complex, multifactorial, and/or polygenic diseases that are influenced by many genes.4
provides a simple graphical representation of a network of genes. It is assumed that the 5 genes on the periphery of the network govern some biochemical and physiologic process that, when perturbed, causes disease. In the figure on the left, the central or “nodal” gene is perturbed and its effect influences the functioning of the 5 peripheral genes. This scenario is consistent with monogenic disease. In the figure on the right, the more central gene has not been perturbed, so that each (or many) of the genes on the periphery of the network must be perturbed in order for the biochemical or physiologic process to fail. This scenario is consistent with polygenic disease.
Fig. 3 A simplistic depiction of the possible origins of simple, monogenic, overtly Mendelian diseases and complex, polygenic diseases, considering the fact that genes work in networks. Arrows connect genes that influence each other and may reflect redundancy, (more ...)
Given that most neuropsychiatric conditions are complex and polygenic, it is unlikely that every individual carrying a particular DNA sequence variation will manifest schizophrenia. That is, perturbations in these genes may not be sufficient, nor even necessary given heterogeneity (see below), to cause the expression of a disease such as schizophrenia. The term “incomplete penetrance” is used to describe the phenomena in which the mere presence of a specific disease allele is not enough to cause the disease.13
The leftmost offspring in family 4 and the mother in family 5 both carry the T allele at the disease-causing (bottom) locus but do not have the disease, reflecting the incomplete penetrance of the T allele.
Many diseases and traits are not “either/or” or binary conditions but rather show quantitative variation in the population. In fact, most traits are like this. Consider schizophrenia, depression, and anxiety: they are usually measured in degrees reflecting severity. Modeling quantitative trait expression requires sophisticated constructs in statistical genetic models.
The network nature of gene activity also provides a mechanism for a single gene to influence multiple observable phenotypes. The phenomena whereby perturbations in a single gene influence multiple clinical or observable phenotypes is termed “pleiotropy” and is likely to be one of the reasons that, eg, schizophrenia and bipolar disorder are often seen in the same families and may have common
(for an explicit genetic analysis of pleiotropy, see Zhang et al23
Many complex diseases may be expressed as a result of different combinations of genetic variations which work independently of other combinations. Thus, it may be the case that none or few of a set of schizophrenia-causing genes are necessary for the expression of the disease phenotype. The father in family 3 of manifests the disease but does not carry the T allele at the disease locus, possibly due to heterogeneity (ie, he has the disease because he carries a different disease-causing variation than the T allele at the bottom locus). Locus heterogeneity arises when different genes influence a disease independently. Allelic heterogeneity arises when different variations within the same gene influence disease susceptibility.
Individuals who have been diagnosed with a disease but do not carry a known disease-causing genetic variation may reflect the imprecision of the diagnostic instrument used (eg, the DSM-IV) and thereby complicate genetic analyses. Such individuals are termed “phenocopies.” Differentiating phenocopies due to the use of a less than precise diagnostic or phenotyping instrument from individuals who manifest a disease without a particular genetic variation due to heterogeneity is problematic. The father in family 3 of may be a phenocopy because he has been diagnosed with the disease but does not carry the T allele at the disease-causing locus.
When both parents in a family possess a disease-causing variation that can be (or has been) transmitted to their offspring, then the family is termed “bilineal” (eg, family 6 of ). Bilineality can cause problems for statistical genetic analyses because one can not easily trace the inheritance of potential disease variations through a single line of descent.24
Therefore, eg, the ascertainment scheme used by the Consortium on the Genetics of Schizophrenia excludes families with evidence of bilineal transmission of schizophrenia.9
One of the most vexing problems in the analysis of case-control–based genetic association studies concerns situations in which the cases are sampled (knowingly or unknowingly) from one population (eg, Australia) and controls are sampled from another (eg, Japan). Because the 2 populations are likely to have very different origins and gene pools, one might observe many different genetic variations providing evidence for association with the disease-bearing individuals (ie, greater frequency in cases), not because of a causal relationship between those variations and the disease but rather because those variations are simply more frequent in the population from which the cases were sampled.4,25
Although it will rarely be the case that sampling of cases and controls is pursued (consciously) from populations as different as, eg, Australia and Japan, more subtle differences can occur if there is any population “substructure” within the geographic locations from which the individuals have been sampled. The “stratification” problem, as it is known, can be overcome through the use of TDT analysis or the use of clever statistical analysis strategies which assess and control for stratification in an association analysis.26,27
Ultimately, stratification does not have to occur as an overt genome-wide allele frequency differences between cases and controls but can rather be more cryptic in the sense that many, but not all, cases are sampled from one population, as are the controls, creating subgroups among the cases and controls, that could lead to false-positive (and false negative) results.28,29
A special form of stratification or genetic background differences can be exploited in combined linkage and association analyses. Individuals that are admixed (ie, have parents, grandparents, etc, who were from different racial or population subgroups known to differ in allele frequencies at many loci as well as disease rates from the population his or her other parent, grandparent, etc, was from). Some of these admixed individuals will have schizophrenia because they have been transmitted a genetic variation that is more likely to have emanated from a parent, grandparent, etc, of a particular subgroup. These individuals can be compared with unaffected individuals to see which regions of the genome or alleles the diseased individuals have in common that are more frequent in the population with the higher disease rates. The idea is that those shared genomic regions and alleles are likely to reflect the variations that contribute to the higher disease rate in the one population and hence are responsible for the disease in the affected subjects.30,31
Epistasis and Gene × Environment Interactions
Modeling and testing gene × gene and gene × environment interactions, if such interactions contribute to disease susceptibility, can be daunting, given the number of potential combinations that can be tested. Despite this fact, recent articles have shown that, in certain instances, such testing can be quite powerful and informative.32–34
It has also been shown that, despite the large number of tests that would be performed, the analyses of 2 or 3 locus interactions can result in statistical significant results.32
Parametric vs Nonparametric Tests
Geneticists often make assumptions about the mode of inheritance of a trait or disease (eg, it is caused by a dominant allele that is fully penetrant) and then incorporate these assumptions into appropriate statistical models. This type of analysis assumes some “parametric” form (ie, the values of certain parameters, such as penetrance and allele frequency, are assumed). Nonparametric statistical genetic analyses do not require as many assumptions. For example, the classic affected sibling pair design in linkage analysis settings merely assesses the degree to which affected siblings (eg, , families 1, 2, 5, and 6) share alleles in a manner that cannot be attributed to chance. Nonparametric tests are notoriously "underpowered” (ie, they require huge sample sizes in order to detect an effect). Parametric analyses, on the other hand, obviously, assume that one has incorporated the correct values of certain parameters in the model, which can be hard to know a priori. The distinction between parametric and nonparametric models is most pronounced in linkage analysis settings, as opposed to association analysis settings, because linkage analysis modeling of the relationship between allele sharing and phenotypic similarity is more complex and subtle than association analysis modeling of the relationship between particular variations and a phenotypes.
Multiple Comparisons and False-Positive Results
When there is no a priori reason to believe that variations in a particular gene contribute to disease susceptibility, researchers are forced to sequentially test hundreds to millions of variations for association or linkage with a trait. Multiple testing of this sort creates enormous potential for false positives if very stringent criteria for declaring statistical significance are not used. Although many guidelines and methods for assessing statistical significance have been proposed for both linkage and association studies,35,36
more work is needed in this area, especially in the context of assessing the biological significance of a potential association. One particularly useful strategy for accommodating multiple comparisons involves the notion of the “false discovery rate” (FDR).37–39
The FDR is used to assess the probability that a large number of statistical tests have produced some test statistics or P
values that are not likely to have occurred by chance given the number of tests performed.