Search tips
Search criteria 


Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2010; 5(7): e11876.
Published online 2010 July 29. doi:  10.1371/journal.pone.0011876
PMCID: PMC2912380

The Impact of Phenocopy on the Genetic Analysis of Complex Traits

Klaus F. X. Mayer, Editor


A consistent debate is ongoing on genome-wide association studies (GWAs). A key point is the capability to identify low-penetrance variations across the human genome. Among the phenomena reducing the power of these analyses, phenocopy level (PE) hampers very seriously the investigation of complex diseases, as well known in neurological disorders, cancer, and likely of primary importance in human ageing. PE seems to be the norm, rather than the exception, especially when considering the role of epigenetics and environmental factors towards phenotype. Despite some attempts, no recognized solution has been proposed, particularly to estimate the effects of phenocopies on the study planning or its analysis design. We present a simulation, where we attempt to define more precisely how phenocopy impacts on different analytical methods under different scenarios. With our approach the critical role of phenocopy emerges, and the more the PE level increases the more the initial difficulty in detecting gene-gene interactions is amplified. In particular, our results show that strong main effects are not hampered by the presence of an increasing amount of phenocopy in the study sample, despite progressively reducing the significance of the association, if the study is sufficiently powered. On the opposite, when purely epistatic effects are simulated, the capability of identifying the association depends on several parameters, such as the strength of the interaction between the polymorphic variants, the penetrance of the polymorphism and the alleles (minor or major) which produce the combined effect and their frequency in the population. We conclude that the neglect of the possible presence of phenocopies in complex traits heavily affects the analysis of their genetic data.


Highthroughput genetic analysis represents the present and the future in catching the genetic determinants of complex diseases[1], [2], [3], [4], [5], [6]. A consistent debate is ongoing on the best approaches to overcome the major issues inherent to genome-wide association (GWA) study designs[7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18].

The most widely used statistical tests are single point statistics (chi-square, or Cochrane-Armitage test) along the genome; these tests can be integrated with haplotype (or multi-marker) analysis once the linkage disequilibrium (LD) structure is drawn and thus haplotype blocks have been identified.

All these tests can be performed under different assumptions and with slightly different approaches, and multivariate analyses are generally performed.

Two main obstacles can be envisaged as:

  1. the false positive rates, and consequently the efficacy of the corrections adopted;
  2. the capability to identify low-penetrance variations across the human genome.

As for false positives, many different approaches have been proposed and, provided the sample collection to be large enough, a multi-stage design has been shown to be very effective in detecting key leads in the genome, often replicated in other populations. It's not the purpose of this paper to address this area[7], [19].

As for the identification of low-penetrance polymorphisms, the area is of a major consideration when disentangling the picture of any complex trait. Indeed, it's quite realistic for complex phenotypes to be determined by a combination of many different polymorphic loci each of them accounting for a minor part of the total variance[20], hence very difficult to be detected when a genome-wide genotyping is performed and when GWA significance rates are applied[20].

Despite this issue being of a key importance, most of the papers reporting GWA studies applied single point statistics, multi-marker analysis and haplotypes analyses, performed LD mapping, adopted different false-positive rate corrections[21], [22], [23], [24], [25]. Few of them actually included interaction analysis and other similar approaches capable to grasp the effect of interactions and across-genome combinations, rather then the main effect of single markers or (despite more importantly) the major contribution of a specific haplotype in a locus[26], [27], [28].

Among the phenomena reducing the power of these analysis, phenocopy hampers very seriously the investigation of complex diseases, a well known issue in neurological disorders [29], [30], cancer [31], and likely of primary importance in the study of human ageing [32]. However, the concept of phenocopy is quite old in genetics, and assumed different meanings according to many different authors: for the purpose of this paper, we mainly refer a definition adopted in linkage studies, where “phenocopy” indicates affected individuals who had acquired the disease by different means than the ones segregating in rest of the family[33]. Moreover, the term here needs to be even more focused, due to the characteristics of the simulating algorithm adopted in this study to generate the disease model and subsequently the datasets: globally we consider here a “phenocopy” an individual marked as affected, but where the underlying genetic markers associated with the disease are different from the other cases in the dataset. We also aknowledge that the classical definition of phenocopy assumes a smooth and wider perspective when we consider the most important complex traits: in this scenario its importance appears to be even higher, due to its intrinsic presence when the interplay of multiple genetic loci determines a disease. Phenocopy (indicated as PE, “phenocopy error”, from the terminology of the genomeSIMLA software) seems to be the norm, rather than the exception, especially when considering the role that epigenetics and environmental factors exert on the phenotype [34].

Considering the scenario we are dealing with, additional terminology needs to be clarified. As previously mentioned, one of the hot topics geneticists are currently debating is whether the so called “missing heritability” issue would find an answer in very rare and highly penetrant mutations (detectable with exome sequencing or whole genome next generation sequencing only [35]), or in a multitude of polymorphisms with no effect when considered alone (main effect) but with a more significant effect when their statistical interaction is considered [36], [37].

As far as this latter point is concerned, several models have been proposed since many years [38] which define “epistasis” (again another term used with different meanings in genetics) as the interaction between different loci, and call “purely epistatic” those interactions between loci that do not display any single locus main effect [37], [38], [39]. This model has been proposed and largely debated [34], [40], [41]: some authors consider the additive model widely used as sufficient to incorporate these effects[42], or argue about the scarce impact of such a scenario, but few papers address specifically this topic[43], [44].

Despite some attempt [45], [46], [47], no widely recognized solution has been therefore proposed, particularly to estimate the effects that phenocopies could exert either on the study planning or its analysis design. At present, the most of the analysis strategies do not take into account the intrinsic presence of phenocopy in complex traits.

We present a simulation [48], [49], [50], [51], where we attempt to define more precisely how phenocopy impacts on different analytical methods under different scenarios.


Simulation of the datasets

Two disease models have been simulated.

In the first model, i.e “model ME”, standing for “Main Effect”, the marker RL0-855 was simulated, having a main effect and an OR = 2.225. Three additional SNPs (Table 1) have been simulated with a very small marginal effect, and an interaction associated with the disease, according to the mixed model offered by the logistic function of genomeSIMLA.

Table 1
The table summarizes the characteristics of the genetic model implemented in the ME model, where one SNP with main effect has been simulated.

In the second model, i.e. “model EPI”, standing for “purely epistatic”, the second disease model (model EPI), three markers (RL0-75 RL0-153 and RL0-272, Table 2) have been simulated in order not to display any main effect and associate with the disease with a purely epistatic penetrance table, with target OR = 4.

Table 2
The table summarizes the SNPs modelled in the purely epistatic model generation, whose penetrance function target odds ratio was set to 4.

For each disease model, the following datasets have been extracted from the population: a) 6 different case-control datasets with increasing phenocopy level generated with the method implemented within the software (PM1); b) 6 different case-control datasets with increasing phenocopy level generated with an alternative method (PM2) develop in our lab, as described in materials and methods; c) 6 pedigree datasets with increasing phenocopy level generated as implemented in genomeSIMLA.

Main effect model

As far as the model ME is concerned, the results show that strong main effects are not hampered by higher levels of PE, despite an inflation of the significance (figure 1).

Figure 1
Case/control dataset - main effect model.

In the case-control dataset with PM1 method, RL0-855 was highly significant at each phenocopy level until 45%, displayed a −log10(p) = 62.54 at 0%PE and a −log10(p) = 25.84 at 45%. The analysis of the datasets obtained with the PM2 phenocopy algorithm produced similar results (see Supplementary Figure S1): the RL0-855 was significant in the 0% phenocopy dataset with a −log10(p)  = 67.5, and a −log10(p)  = 31.2 in the 45% dataset.

A very similar behaviour appears to happen on the pedigrees dataset, with TDT analysis, even if the overall significance level is a bit lower (−log10(p) = 40 at 0%PE and −log10(p) = 8.63, see Supplementary Figure S2).

Among the other markers where only an interaction was simulated, only the marker RL0-245 appeared among the top ten significant at 0%PE (−log10(p) = 11.47) but it was no more on the top 10 when the phenocopy level reached 10%. The same happened on the TDT analysis.

Purely epistatic model

When we analyzed the EPI model on the case control dataset, none of the three markers ranked among the top list of significant markers. Moreover if we had to correct for multiple testing, none of the markers would reach a 0.05 level of significance neither at 0% PE level, nor at 45%.

Despite some fluctuations on the data, mainly due to sampling and data extraction, a positive but no significant trend in the number of falsely significant markers could be observed according to the increase of phenocopy error percentage (figure 2). The same pattern was observable when analyzing the case-control dataset generated with the PM2 phenocopy method (see Supplementary Figure S3).

Figure 2
Case/control dataset - purely epistatic model.

When applying PM2 we observed the appearence of a single progressively significant marker (RL0-255), which was borderline for the Hardy-Weinberg equilibrium in the main dataset and therefore was unbalanced when affected individuals from different dataset suffering the same simulation phenomenon were added. This SNP can be considered a false positive, as it was not simulated in association of the disease in none of the additional datasets.

A similar behaviour of the markers with a purely epistatic effect was observable in the pedigree dataset with a TDT analysis: again none of them ranked as significant (Supplementary Figure S4).

In order to check for the correctness of the model we generated, we performed a logistic regression on the interaction term between the three markers we simulated to be associated with a purely epistatic effect. The p value of the logistic regression was highly significant both at a 0% PE (p = 7.8*10−21) and at a 45% PE (p = 4.17*10−6).

Therefore we decided to analyze the data by using a logic regression approach. Logic Regression is an adaptive regression methodology mainly developed to explore high-order interactions in genomic data and its goal is to find predictors that are Boolean (logical) combinations of the original predictors. By applying this methodology the analysis was capable to identify in most cases two of the three interacting SNPs among the top ranking interactions (figure 3).

Figure 3
Logic regression on a purely epistatic model.

The more the phenocopy error was increasing and the more these interactions ranked lower, even if in any case at least one of the three markers (RL0-153) was always present among the top five.

As a purely epistatic model is a challenge for the analysis in itself, we adopted a further analysis method, i.e. the multifactor dimensionality reduction (MDR)[44], [52]. MDR analysis was performed on the EPI model with PM2 phenocopy levels.

Comparably with the logic regression analysis, the MDR method perfomed with random non exhaustive explorations, was unable to catch efficiently all the interactions, and this became more evident with increasing PE levels (Supplementary Table S1). When testing directly the interacting SNPs, the efficiency and the OR of the MDR outcome was very close to the modelled one, but these values progressively decreased the more the PE level increased: at a 0% PE the predicted OR was 3.80 (compared to a target OR of the model = 4.0) and at 45% PE the predicted OR decreased to 2.39 (table 3, and Supplementary Figure S5 and Supplementary References S1).

Table 3
MDR test on purely epistatic model interactions.


Investigating the genetic determinants of complex traits challenges researchers with obstacles yet unresolved completely. We can argue that the genetic scenario of the most important complex traits is not explainable in black and white, i.e. only by the presence of very rare variants yet to be discovered with sequencing or by the presence of purely epistatic effects. Complex traits are likely determined by a different contribution of both causes, with proportions that can differ from a phenotype to another. In this paper we chose to address this second aspect which deserves specific attention.

The characterization of the phenotypes is of extreme importance to this regard, and in our work we focused simulations of genetic data on the analysis of the effect that phenocopy levels could have in the capability to understand the genetic determinant of a disease with different methodologies.

We would like to stress that the concept of “phenocopy” can be interpreted in several ways, as we pointed out in the introduction, and that the classical definitions of phenocopies should be largely revisited in the context of complex traits, where multilocus genotypes could play a decisive role. Yet this aspect plays a major role in the discovery of genetic determinants: if to a certain extent complex traits could be considered by definition phenocopies, and if purely epistatic interactions play an important role in the missing heritability (perhaps along undiscovered rare variants), then future analysis methods have to take into account this scenario and model not only interactions, but also phenocopy within their statistical model.

In our simulation we decided to verify the impact of phenocopy level by testing two methods for the generation of phenocopies: the PM2 method we developed, specifically produces phenocopies by introducing affected individuals in which different genetic determinants have been simulated. The PM2 method thus allowed us to test a scenario where different combinations of loci could produce the same phenotype.

Our results show that strong main effects are not hampered by the presence of an increasing amount of phenocopy in the study sample, despite progressively reducing the significance of the association, if the study is sufficiently powered.

On the opposite, when purely epistatic effects are simulated, the capability of identifying the association depends on several parameters, such as the strength of the interaction between the polymorphic variants, the penetrance of the polymorphism, the alleles (minor or major) which produce the combined effect and their frequency in the population. The influence of these parameters has been partially discussed in 0% PE datasets in the literature. In our simulation the critical role of phenocopy emerges, and the more the PE level increases the more the initial difficulty in detecting these gene-gene interactions is amplified, even with methodologies more suitable to the discovery of epistatic models.

Classical analytical methodologies are very sensible to this error, and new statistical methods have to be developed, addressing in a less computing-intensive way SNP-SNP interactions as well as accounting or adjusting their results on estimates of the phenocopy error.

Since the presence of phenocopy can be a characteristic intrinsic to the phenotyping of complex traits, we conclude that the neglect of the possible presence of phenocopies in these scenarios heavily affects the analysis of their genetic data.

Materials and Methods


We performed simulations by using the software genomeSIMLA[50] which performs the simulation of large-scale genomic data both in population based case-control samples and in families. It is a forward-time population simulation algorithm that allows the user to specify many evolutionary parameters and control evolutionary processes and allows the user to specify varying levels of both linkage and LD among and between markers and disease loci. [48], [49], [53]. Particular SNPs may be chosen to represent disease loci according to desired location, correlation with nearby SNPs, and allele frequency. Up to six loci may be selected for main effects and all possible 2 and 3-way interactions. Disease-susceptibility effects of multiple genetic variables can be modeled using either the SIMLA logistic function [49], [53] or a purely epistatic multi-locus penetrance function [41] found using a genetic algorithm to assign affected status (for program configuration files see Supplementary Model S1).

Disease models

We generated two different disease models.

In the first one (referred to as “model ME”, standing for “Main Effect”) a single SNP (RL0-855, figure 4) was simulated to have a main effect on disease, with an OR = 2.225; at the same time the disease model included also three other SNPs (RL0-75, RL0-245, RL0-457) with no main effect and an interaction associated to the affection status. We simulated this model on a single chromosome with 1.362 markers.

Figure 4
LD plot from main effect model dataset.

In the second model (referred to as “model EPI”, standing for “purely epistatic”), we performed a simulation on a smaller chromosome (401 markers), where no main effect was present and three SNPs (RL0-75, RL0-153, RL0-272) were affecting disease with only a purely epistatic disease model, generated by using SIMPEN [49]. The penetrance table was generated with a target OR = 4.

In both simulations the SNP chosen to be associated with the disease had a MAF>0.30, in order to allow us to simulate the condition so called “common variant common disease”[54], [55], [56]. Table 1 and Table 2 provide information on the associated markers and their target OR. Supplementary Figure S6 gives additional details on the disease model generation.

For each of the two models case-control data and pedigree data were generated. On each case six different large pooled datasets were extracted, with an increasing level of phenocopy error (i.e. 0%, 5%, 10%, 20%, 30% and 45%). In order to avoid biases due to data extraction and fluctuation, each dataset has been obtained by sampling and then pooling 50 different datasets on each PE level.

The case/control simulation included datasets of 200 cases and 200 controls each, i.e. finally 20.000 individuals each PE level dataset.

Each family simulation included 25 families with 1 affected sib and 2 unaffected, 25 families with 3 affected and 1 unaffected, 25 families with 2 affected, 2 unaffected sibs and 3 random extra sibs: the total number of individuals for each dataset of different PE level was 25.000 samples. Supplementary Figure S7 gives additional details on the datasets generation.

Generation of the phenocopies

The genomeSIMLA software version used (1.0.7w32), currently implements a method for generating the phenocopy designed as follows.

The software generates cases and controls using the penetrance function and the marker specified by the user. Then, in case-control datasets, it removes a percentage (user specified) of cases and replace them with individuals sampled from the control individuals in the full population and assign them the affected status. In family datasets, the software determines the total number of affected to modify as phenocopies, identifies the pedigrees to be modified and redraw the family according to the new requirements. Pedigrees with the required number of affected and unaffected are selected and then the unaffected phenocopies are marked as affected, according to the initial design specified by the user (personal communication).

This method has been referred as “phenocopy method one” (PM1).

In order to verify the correspondence of such phenocopy generation method with what we defined as “phenocopy” (see introduction), we also developed another methodology to be applied on the case-control datasets only. According to this second algorithm (referred into the article as “phenocopy method two”, PM2), five additional datasets have been generated, with different markers associated to the affected status. In order to generate the phenocopy level required, a uniform random sampling of affected individuals from the five additional datasets have been performed, and these individuals have been substituted with affected individuals randomly picked up from the original dataset. This method generates five datasets with the same phenocopy percentage as the PM1. Supplementary Figure S8 provides a more detailed explanation and supplementary Box S1 reports the R code used to generate these datasets. Table 2 provides information about the markers associated to the affection status in the additional datasets and the target OR used.

Statistical analysis

The analysis were conducted using the R software ( and PLINK. In particular whole-chromosome case-control analysis and TDT analysis were performed with PLINK and visualized with R. The calculation of genetic contrasts and the logistic regression on single markers, markers' interaction analysis with logistic regression where performed according to Clayton as developed in the “DGCgenetics” package. Interaction analysis by using a logic regression approach was performed by using the R package “logicFS” by Schwender, according to the developer's specifications.[27]

The MDR analysis has been conducted by using the MDR java package ([57] and performing 5.000 random explorations in the model discovery of attributes ranging from 2 to 4-way interactions, as implemented in the software.

Supporting Information

Model S1

Model Configuration files.

(0.01 MB ZIP)

Box S1

R code used to generate the alternative phenocopy method datasets.

(0.03 MB DOC)

References S1

References cited in the Supplementary Information.

(0.03 MB DOC)

Table S1

The table summarizes the 10 best models for each phenocopy level identified during the MDR analysis. It has to be stressed that the MDR analysis has been conducted by performing 5.000 evaluations of possible interactions. An exhaustive analysis as implemented in the software would be computationally very intensive, as pointed out by the authors in a recent paper (see Pattin K. A. et al. [4]). In bold the correct SNPs as modelled in the purely epistatic penetrance function.

(0.08 MB DOC)

Figure S1

For the case-control dataset generated with the main effect disease model (see SF6), an alternative method of producing phenocopies has been applied (see SF8). The method displays the same performance of the internally implemented one, with the only exception of few markers which progressely fall outside the equilibrium of Hardy-Weinberg, thus resulting in a false-positive association (indicated by the arrow). The red circle indicates the marker associated with the disease in the main dataset.

(1.29 MB EPS)

Figure S2

The figure summarizes the significance level for each marker in the pedigree datasets simulated with a main effect disease model at each phenocopy level. The red circle indicates the marker associated with a main effect to the disease in the model. The PM1 phenocopy generation method was applied.

(1.31 MB EPS)

Figure S3

For the case-control dataset generated with the purely epistatic disease model (see SF6), an alternative method of producing phenocopies has been applied (see SF8). The method displays the same performance of the internally implemented one, with the only exception of one marker which progressively falls outside the equilibrium of Hardy-Weinberg, thus resulting in a false-positive association (indicated by the arrow).

(1.22 MB EPS)

Figure S4

The figure summarizes the significance level of the markers in pedigree datasets, at each phenocopy level. The red circles indicate the position of the markers associated in the model, which is the same in the other plots.

(1.54 MB EPS)

Figure S5

MDR attribute construction. The figures illustrates the distribution of cases (left bars) and controls (right bars) when the three associated SNPs are considered jointly.

(2.07 MB EPS)

Figure S6

Two disease models have been applied. In the first model a single SNP displays a main effect (target OR = 2.225) and three additional SNPs do not have a main effect and interact with each other with a modest effect; this model is implemented as part of the SIMLA logistic function[1]. In the second model instead, three SNPs have been simulated as having no main effect, and a purely espistatic effect on the disease (with a target OR = 4); this model has been implemented in genomeSIMLA and it has been proposed by Culverhouse [2] and discussed by Moore [2], [3].

(1.07 MB EPS)

Figure S7

For each disease model, two groups of datasets have been generated: a case-control dataset and a family based dataset. In order to reduce the fluctuations due to the sampling, in each case 50 different smaller datasets have been independently sampled from the population and then merged together in order to obtain a large pooled dataset. The figure explains the process step by step.

(1.37 MB EPS)

Figure S8

The method has been developed by using the R software (code provided) in order to perform a random sampling from five additional datasets where different SNPs have been associated in the disease model with the affected individuals. A uniform and random sampling, followed by a random substitution of the individuals in the original dataset produced different levels of phenocopies in the sample, thus generating six dataset with increasing phenocopy percentage. This method ensures the effective substitution of individuals generated as affected but with completely different causative markers. The method has been developed as a further analysis of possible effect generated by the “phenocopying” method implemented in the genomeSIMLA software.

(1.67 MB EPS)


Competing Interests: The authors have declared that no competing interests exist.

Funding: The study has been supported under running costs funding of the university. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


1. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
2. Butcher LM, Plomin R. The Nature of Nurture: A Genomewide Association Scan for Family Chaos. Behav Genet 2008 [PMC free article] [PubMed]
3. Florez JC, Manning AK, Dupuis J, McAteer J, Irenze K, et al. A 100K genome-wide association scan for diabetes and related traits in the Framingham Heart Study: replication and integration with other genome-wide datasets. Diabetes. 2007;56:3063–3074. [PubMed]
4. Ionita-Laza I, McQueen MB, Laird NM, Lange C. Genomewide weighted hypothesis testing in family-based association studies, with an application to a 100K scan. Am J Hum Genet. 2007;81:607–614. [PubMed]
5. Wilk JB, Walter RE, Laramie JM, Gottlieb DJ, O'Connor GT. Framingham Heart Study genome-wide association: results for pulmonary function measures. BMC Med Genet. 2007;8(Suppl 1):S8. [PMC free article] [PubMed]
6. Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet 2008 [PMC free article] [PubMed]
7. Clarke GM, Carter KW, Palmer LJ, Morris AP, Cardon LR. Fine mapping versus replication in whole-genome association studies. Am J Hum Genet. 2007;81:995–1005. [PubMed]
8. Curtis D. Allelic association studies of genome wide association data can reveal errors in marker position assignments. BMC Genet. 2007;8:30. [PMC free article] [PubMed]
9. Dong C, Qian Z, Jia P, Wang Y, Huang W, et al. Gene-centric characteristics of genome-wide association studies. PLoS ONE. 2007;2:e1262. [PMC free article] [PubMed]
10. Ioannidis JP. Non-replication and inconsistency in the genome-wide association setting. Hum Hered. 2007;64:203–213. [PubMed]
11. Ioannidis JP, Patsopoulos NA, Evangelou E. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE. 2007;2:e841. [PMC free article] [PubMed]
12. Kingsmore SF, Lindquist IE, Mudge J, Gessler DD, Beavis WD. Genome-wide association studies: progress and potential for drug discovery and development. Nat Rev Drug Discov. 2008;7:221–230. [PMC free article] [PubMed]
13. Li C, Li M, Long JR, Cai Q, Zheng W. Evaluating cost efficiency of SNP chips in genome-wide association studies. Genet Epidemiol 2008 [PMC free article] [PubMed]
14. Li M, Li C, Guan W. Evaluation of coverage variation of SNP chips for genome-wide association studies. Eur J Hum Genet 2008 [PubMed]
15. Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008;32:215–226. [PubMed]
16. Macgregor S, Zhao ZZ, Henders A, Nicholas MG, Montgomery GW, et al. Highly cost-efficient genome-wide association studies using DNA pools and dense SNP arrays. Nucleic Acids Res. 2008;36:e35. [PMC free article] [PubMed]
17. Pearson TA, Manolio TA. How to interpret a genome-wide association study. JAMA. 2008;299:1335–1344. [PubMed]
18. Rao DC. An overview of the genetic dissection of complex traits. Adv Genet. 2008;60:3–34. [PubMed]
19. Skol AD, Scott LJ, Abecasis GR, Boehnke M. Optimal designs for two-stage genome-wide association studies. Genet Epidemiol. 2007;31:776–788. [PubMed]
20. Tomlinson I, Webb E, Carvajal-Carmona L, Broderick P, Kemp Z, et al. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet. 2007;39:984–988. [PubMed]
21. Hakonarson H, Qu HQ, Bradfield JP, Marchand L, Kim CE, et al. A novel susceptibility locus for type 1 diabetes on Chr12q13 identified by a genome-wide association study. Diabetes. 2008;57:1143–1146. [PubMed]
22. Hinney A, Nguyen TT, Scherag A, Friedel S, Bronner G, et al. Genome Wide Association (GWA) Study for Early Onset Extreme Obesity Supports the Role of Fat Mass and Obesity Associated Gene (FTO) Variants. PLoS ONE. 2007;2:e1361. [PMC free article] [PubMed]
23. Kayser M, Liu F, Janssens AC, Rivadeneira F, Lao O, et al. Three genome-wide association studies and a linkage analysis identify HERC2 as a human iris color gene. Am J Hum Genet. 2008;82:411–423. [PubMed]
24. Raelson JV, Little RD, Ruether A, Fournier H, Paquin B, et al. Genome-wide association study for Crohn's disease in the Quebec Founder Population identifies multiple validated disease loci. Proc Natl Acad Sci U S A. 2007;104:14747–14752. [PubMed]
25. Todd JA, Walker NM, Cooper JD, Smyth DJ, Downes K, et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat Genet. 2007;39:857–864. [PMC free article] [PubMed]
26. Kooperberg C, Leblanc M. Increasing the power of identifying gene x gene interactions in genome-wide association studies. Genet Epidemiol. 2008;32:255–263. [PMC free article] [PubMed]
27. Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L. Sequence analysis using logic regression. Genet Epidemiol. 2001;21(Suppl 1):S626–631. [PubMed]
28. Schwender H, Ickstadt K. Identification of SNP interactions using logic regression. Biostatistics. 2008;9:187–198. [PubMed]
29. Wider C, Melquist S, Hauf M, Solida A, Cobb SA, et al. Study of a Swiss dopa-responsive dystonia family with a deletion in GCH1: redefining DYT14 as DYT5. Neurology. 2008;70:1377–1383. [PMC free article] [PubMed]
30. Singh SM, McDonald P, Murphy B, O'Reilly R. Incidental neurodevelopmental episodes in the etiology of schizophrenia: an expanded model involving epigenetics and development. Clin Genet. 2004;65:435–440. [PubMed]
31. Xu J, Meyers D, Freije D, Isaacs S, Wiley K, et al. Evidence for a prostate cancer susceptibility locus on the X chromosome. Nat Genet. 1998;20:175–179. [PubMed]
32. De Benedictis G, Franceschi C. The unusual genetics of human longevity. Sci Aging Knowledge Environ. 2006;2006:pe20. [PubMed]
33. Rannala B, Reeve JP. High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence. Am J Hum Genet. 2001;69:159–178. [PubMed]
34. Moore JH, Barney N, Tsai CT, Chiang FT, Gui J, et al. Symbolic modeling of epistasis. Hum Hered. 2007;63:120–133. [PubMed]
35. Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A. 2009;106:19096–19101. [PubMed]
36. Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404. [PMC free article] [PubMed]
37. Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9:855–867. [PMC free article] [PubMed]
38. Moore JH, Williams SM. Epistasis and its implications for personal genetics. Am J Hum Genet. 2009;85:309–320. [PubMed]
39. Culverhouse R, Suarez BK, Lin J, Reich T. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002;70:461–471. [PubMed]
40. Wongseree W, Assawamakin A, Piroonratana T, Sinsomros S, Limwongse C, et al. Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses. BMC Bioinformatics. 2009;10:294. [PMC free article] [PubMed]
41. Moore JH, Hahn LW, Ritchie MD, Thornton TA, White BC. Routine discovery of complex genetic models using genetic algorithms. Applied Soft Computing. 2004;4:79–86. [PMC free article] [PubMed]
42. Clayton DG. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009;5:e1000540. [PMC free article] [PubMed]
43. Zubenko GS, Hughes HB, 3rd, Stiffler JS. D10S1423 identifies a susceptibility locus for Alzheimer's disease in a prospective, longitudinal, double-blind study of asymptomatic individuals. Mol Psychiatry. 2001;6:413–419. [PubMed]
44. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69:138–147. [PubMed]
45. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19:376–382. [PubMed]
46. Motsinger-Reif AA, Fanelli TJ, Davis AC, Ritchie MD. Power of grammatical evolution neural networks to detect gene-gene interactions in the presence of error. BMC Res Notes. 2008;1:65. [PMC free article] [PubMed]
47. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003;24:150–157. [PubMed]
48. Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD. Data simulation software for whole-genome association and other studies in human genetics. Pac Symp Biocomput: 2006:499–510. [PubMed]
49. Schmidt M, Hauser ER, Martin ER, Schmidt S. Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction. Stat Appl Genet Mol Biol. 2005;4:Article15. [PMC free article] [PubMed]
50. Edwards TL, Bush WS, Turner SD, Torstenson ES, Dudek SM, et al. genomeSIMLA: a data simulation package to explore the human genome. 2007. 2007 Annual Meeting of the American Society of Human Genetics. San Diego, California.
51. Edwards TL, Bush WS, Turner SD, Dudek SM, Torstenson ES, et al. Generating Linkage Disequilibrium Patterns in Data Simulations using genomeSIMLA. Lect Notes Comput Sci. 2008;4973:24–35. [PubMed]
52. Pattin KA, White BC, Barney N, Gui J, Nelson HH, et al. A computationally efficient hypothesis testing method for epistasis analysis using multifactor dimensionality reduction. Genet Epidemiol. 2009;33:87–94. [PMC free article] [PubMed]
53. Bass MP, Martin ER, Hauser ER. Pedigree generation for analysis of genetic linkage and association. Pac Symp Biocomput: 2004. pp. 93–103. [PubMed]
54. Guthery SL SB, Pungliya MS, Stephens JC, Bamshad M. The structure of common genetic variation in United States populations. Am J Hum Genet. 2007;81:1221–1231. [PubMed]
55. Pritchard J. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69:124–137. [PubMed]
56. Reich DE LE. On the allelic spectrum of human disease. Trends Genet. 2001;17:502. [PubMed]
57. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241:252–261. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science