Several genome-wide association studies have recently identified novel susceptibility loci for medical conditions such as diabetes mellitus and schizophrenia (Barret et al. 2009
; Stefansson et al. 2009
). This has increased the need to investigate the phenotypic differences that are conferred by such quantitative trait loci. However, due to the small to modest contributions of single loci to most complex traits, such phenotypic differences are hard to detect. The problem of small effect sizes of specific alleles or haplotypes is compounded by the complexity of many of the traits of interest. Most candidate genes have been identified in subjects with a disorder that incorporates a broad variety of symptoms. The co-occurrence of several symptoms at once can be the result of pleiotropic effects of a singe variant, but may also be due to the underlying abnormality. Many disorders coincide with physical, emotional, and social abnormalities; for example, depression is associated with cognitive problems, cardiovascular risks and social problems, among others. These concomitant phenomena are likely to influence the expression of the original trait, and may well obscure the initial relationship between a candidate gene and a trait. As a consequence, the particular symptoms or abnormalities associated with these genes remain unclear.
We, therefore, propose reversing the process: instead of selecting a trait and examining its relationship with the underlying genes, we will select genetic variation and examine the accompanying trait. Testing the influence of a particular gene on phenotypes is a common approach in both animal and molecular research, where the influence of genetic variation is often studied by inbreeding the genetic variant by creating a knockout mouse, or by transposing the variant of interest into cell cultures or organisms by means of a vector. The statistical power of an association test for a candidate gene depends on the distribution of genotypes in the test population. A maximum statistical power for a given number of phenotyped individuals is obtained when the test population consists of equal numbers of alternative homozygotes at the candidate gene (as is the case when the two alleles at the candidate gene are of equal frequency). However, since allele frequencies at the candidate gene locus are generally far from equal, the distribution of informative alleles in the population as a whole is generally far from this optimal distribution. Thus, depending on the relative costs of determining phenotypes compared to the cost of genotypes, it may be more effective to genotype a large sample population and then choose a set of individuals with an optimal distribution of genotypes for further phenotyping. Here we provide information on the statistical power under different genotype sampling strategies, as a function of explained variance, dominance and allele frequency at the candidate gene, and on phenotype/genotype cost ratios.
Selecting subjects from the general population based on homozygosity for a candidate gene instead of subjects with an apparent disorder has two major advantages. (1) It means the investigation of the relationship with genotype is unbiased by selection for severity of disease, and we therefore avoid bias as a result of secondary symptoms. (2) This approach facilitates the estimation of the effects of single variants in relative isolation, because the selection is not based on phenotype. As a consequence, there is no selection for the presence of additional risk variants for that particular phenotype although it will shift the distribution of the phenotype.
The value of such a “forward genetics” approach is seen in the increase in statistical power and its cost-effectiveness. As already pointed out, the increase in power is due to the selection of the most informative subjects. Other strategies, such as the extreme discordant and concordant design (Risch and Zhang 1995
), in essence do the same by selecting extreme phenotypes. With the ever reducing costs of genotyping, our strategy only has merit if the cost of obtaining phenotype information is high. In studies of complex and quantitative phenotypes, such as those that apply costly neuroimaging, this approach can be particularly advantageous.
We investigated the sample size requirement of this approach and the cost-effectiveness under different scenarios.