|Home | About | Journals | Submit | Contact Us | Français|
Although schizophrenia is generally considered to occur as a consequence of multiple genes that interact with one another, very few methods have been developed to model epistasis. Phenotype definition has also been a major challenge for research on the genetics of schizophrenia. In this report we use novel statistical techniques to address the high dimensionality of genomic data, and we apply a refinement in phenotype definition by basing it on the occurrence of brain changes during the early course of the illness, as measured by repeated MR scans (i.e., an “intermediate phenotype.” The method combines a machine learning algorithm, the ensemble method using stochastic gradient boosting, with traditional general linear model statistics. We began with fourteen genes that are relevant to schizophrenia based on association studies or their role in neurodevelopment and then used statistical techniques to reduce them to five genes and 17 SNPs that had a significant statistical interaction: 5 for PDE4B, 4 for RELN, 4 for ERBB4, 3 for DISC1, and one for NRG1. Five of the SNPs involved in these interactions replicate previous research, in that these five SNPs have previously been identified as schizophrenia vulnerability markers or implicate cognitive processes relevant to schizophrenia. This ability to replicate previous work suggests that our method has potential for detecting a meaningful epistatic relationships among the genes that influence brain abnormalities in schizophrenia.
Schizophrenia is conceptualized as a genetically complex disorder that is probably the result of multiple alleles of small effect in many different genes, at least some of which are likely to interact with one another.1 Despite the widely-recognized plausibility of this conceptualization, progress in delineating the genetic and neural mechanisms of schizophrenia has been limited by the difficulty of developing adequate models that can account for genetic complexity, particularly epistasis. Epistasis has several connotations: one in which a gene suppresses the effect of another; and a number of genes interacting together to influence the expression of the phenotype. Here, we use epistasis in the latter sense.
Phenotype definition has also been a major challenge for research on the genetics of schizophrenia. Schizophrenia is characterized by a variable mixture of psychotic, disorganized, and negative symptoms; it also has a broad variety of clinical outcomes, ranging from prolonged periods of remission to a disabling treatment-refractory course that is sometimes accompanied by cognitive decline.2 Most current genetic studies do not take the complexity of the phenotype into account. Genome-wide association studies (GWAS) have been disappointing because of poor replicability, a problem likely to be due in part to weak or imprecise phenotype definition.3–15 GWAS only require that subjects meet specified diagnostic criteria, such as those put forth in DSM IV or ICD 10. This limited approach to phenotype definition runs the risk of grouping clinically (and probably genetically) diverse patients under a single rubric, thereby reducing the likelihood that replicable and accurate associations will be found. In this report we describe several approaches that we have implemented to address the problems of examining epistasis and improving phenotype definition.
An inherent problem in the effort to link improved phenotype definition with complex genetic concepts such as epistasis is the difficulty inherent in detecting meaningful signals within the noisy vastness of the human genome. Various approaches have been developed to reduce the high dimensionality of genomic data, such as the use of gene ontology or pathway analysis.16 Another approach for addressing high dimensionality is the use of statistical techniques such as machine learning algorithms (MLA). The utility of these approaches has been demonstrated recently in studies that have used techniques such as recursive partitioning to model epistatic relationships.17 We have developed a strategy for using similar algorithms to identify potentially meaningful relationships between schizophrenia vulnerability genes. Our approach employs a MLA, stochastic gradient boosting using regression trees,18 and performs a series of steps to winnow through a large genomics data set and to identify genes and SNPs that may interact with one another to produce brain tissue loss in schizophrenia.
In this report we also address the problem of phenotype definition by basing it on the organ that produces the clinical picture of the illness: the brain. Numerous well-replicated studies have documented that patients with schizophrenia have measurable brain tissue reduction when compared with healthy volunteers.19–33 The tissue loss affects multiple brain regions, particularly overall cerebral volume and frontal and temporal lobe gray matter (GM) and white matter (WM); tissue loss in these regions is accompanied by a corresponding increase in cerebrospinal fluid (CSF). These brain volume reductions are present at the onset of the illness, but they also continue to progress over time, with the greatest severity occurring during the first few years after onset.33 These tissue changes are considered to be due to an aberration in a neurodevelopmental process for several reasons: the typical onset of illness during adolescence, when the brain is undergoing major developmental changes; and the absence of an neuropathologic signature suggesting a degenerative process such as neuronal loss or gliosis.
Because we have conducted a prospective longitudinal study of patients with schizophrenia ascertained at the time of onset ("first episode patients") and followed them with repeated Magnetic Resonance (MR) scans for a time period up to eighteen years, we have been able to develop a refinement in phenotype definition by basing it on the occurrence of brain changes during the early course of the illness, as measured by repeated MR scans.33
Therefore, in this report we describe a data analytic strategy for evaluating epistatic relationships between multiple genes and their alleles with a biologically-based phenotype of schizophrenia. This strategy is summarized in Figure 1 and consists of seven steps:
Subjects for this study were recruited through the Iowa Longitudinal Study (ILS). The ILS was initiated nineteen years ago and includes a total cohort of 542 first episode schizophrenia patients. These patients were recruited from consecutive admissions to the University of Iowa Psychiatry inpatient service at a rate of 25–30/year; recruitment ended in 2007. They have been followed at six month intervals after initial intake, with assessment of clinical symptoms, psychosocial function, and treatment received. Intensive assessments (sMR and cognitive testing) are done at intake and at two, five, nine, twelve, fifteen, and eighteen years.33 In this report we focus on those subjects for whom we have both genomic data and adequate sMR data to provide a relatively definitive determination of whether progressive brain change occurs over a time interval that is up to 12 years after intake. These comprise a total of 144 patients for whom we have a minimum of 2 scans and a maximum of six. (See Table 1 in Supplementary Materials for demographics and clinical characteristics.)
To genotype study participants, DNA was prepared by high-salt extraction from whole blood and assayed using Illumina Infininum II array BeadChips which were designed, manufactured and completed by Illumina (San Diego, CA). The locus success rate and genotype call rate on the 26,122-SNP BeadChips were 97.2% and 99.8% respectively. We used a customized Illumina chip provided by Ortho-McNeill Janssen that contains SNPs for 1204 genes; the SNPs are primarily tag SNPs. The genes were selected for inclusion on the chip based on their association with major mental illnesses, their role in neurotransmission, their role in regulating major metabolic or developmental pathways, and their role in drug metabolism. For this study we selected fourteen genes for examination of statistical epistasis in relation to brain change measures. Because we were interested in the specific examination of the neurodevelopmental mechanisms influencing brain change in schizophrenia, we selected the genes based on the following criteria: established association with schizophrenia; influence on neurodevelopmental pathways; neurotransmitters implicated in schizophrenia. The rationale for the selection of these fourteen genes (810 SNPs) is summarized in Table 1.
To create a measure of percent change for use in our analyses, the MRI volumes were divided by the intracranial volume to correct for variability in head size. We have previously found that the greatest brain changes occur early in the course of the illness.33 Therefore, our measures are a percent change score based on the earliest available scan and its subsequent one. The value for the first scan was subtracted from the value for the second scan; that value was then divided by the number of years between the scans; the result was then divided by the value for the first scan, and finally that number was multiplied by 100 to provide an index of percent change.
The goal of our statistical approach was to identify a logical and empirically-driven method for detecting the signal (potential epistatic relationships between genes that may affect brain volume changes) within our vast array of genomic data. (See Figure 1)
Our first step was designed to determine which of the fourteen candidate genes had the strongest relationship with brain tissue change over time. We used regression trees to identify the relative importance of genes (SNPs) for predicting the amount of change in the size of each of the brain regions. This analysis employed an ensemble approach with stochastic gradient boosting (SGB) (TreeNet).18 Variable importance (VI) scores, generated by the SGB analysis, were used to identify the SNPs and genes that were retained during the initial steps of variable reduction. Variable importance scores show the relative contribution of each of the variables (SNPs) to predicting the outcome (a measure of brain volume change).
VI scores are relative; the largest value is assigned a value of 100 and the remaining variables are scaled accordingly. The variable importance measure for each gene, as a predictor of each brain volume measure, is calculated using the variable importance of the top 20 SNPs for predicting each brain volume measure. It is a percent of variable importance, calculated by summing the VI scores for the top 20 SNPs and dividing each SNP VI score by the total of the 20. A genewise VI score is then calculated by adding up the VIs for each SNP to get a gene summary.
The steps involved are summarized as follows.
From SNPs identified in the second SGB, we used general linear models (GLM) to further assess SNP*SNP interaction effects on brain volume changes over time. Analyses were performed using SAS (version 9.2; SAS, Cary, North Carolina). Percent brain volume change measure was entered as the dependent variable. Genotype of SNP pairs and SNP*SNP interaction terms were entered as independent measures in the statistical models. For all analyses, intracranial volume at initial MR scan, gender, imaging protocol (MR5 versus MR6), age at initial MR scan, and amount of neuroleptic exposure measured using dose-years were included as covariates. Because our technique is exploratory, we did not correct for multiple comparisons; we report only findings with P<.02.
Table 2 and Figure 2 show the VI of the fourteen genes for predicting severity of brain volume changes, broken down into fourteen specific brain regions. Three genes stand out by virtue of having a high variable importance for all the brain regions that we examined: DISC1, ERBB4, and RELN. Two others, NRG1 and PDE4B, had high variable importance for twelve regions. The remaining genes had lesser (e.g., BDNF, COMT, and GRM3) or very little importance (e.g., AKT1, DAOA, GDNF, DTNBP1, KCNH2, and RGS4). We therefore retained the top five genes for further examination.
We then repeated SGB using these five genes and their associated SNPs (N=735). We again identified the top 20 SNPs for each brain region based on VI. The analysis further reduced the “candidate SNPs” to a total of 267 (Table 2, Supplementary Materials). A prominent role continued to be played by the SNPs belonging to DISC1 (58 discrete SNPs), ERBB4 (87 discrete SNPs), and RELN (61 discrete SNPs). NRG1 and PDE4B contribute fewer (23 and 38 respectively). We then reduced these data by identifying groups of SNPs that may have an epistatic relationship because they share an association with a given specific brain region. These potential interactions were then tested for statistical significance by being placed in a general linear model, using relevant covariates, and determining whether there was a significant SNP*SNP interaction. Eighteen SNP pairs were found to pass this filter.
The final filter in the search consisted of examining these pairs in relation to the brain regions with which they were associated. We limited this final group to those SNP pairs that had a significant relationship with more than one brain region and that was biologically plausible (e.g., reduced cerebral volume accompanied by increased CSF, as an indicator of a generalized tissue loss; reduced WM in two regions as indicator of tissue specificity). Eleven interactions were selected for analysis using these criteria. They are shown in Table 3. They include all five genes retained up to this point and a total of 17 SNPs: 5 for PDE4B, 4 for RELN, 4 for ERBB4, 3 for DISC, and one for NRG1.
Five of these SNPs have been previously identified as increasing schizophrenia risk or are in linkage disequilibrium with known schizophrenia risk alleles or alleles affecting a cognitive function relevant to schizophrenia.45–52 Many of them involve RELN. Two of the SNPs involve an interaction between RELN and PDE4B alleles.
One is an interaction between RELN rs2229860 and PDE4B rs54402. This interaction is associated with a decrease in cerebral tissue and an increase in surface CSF. Rs2229860 is an identified schizophrenia risk allele on the RELN gene; it is a rare A/G mutation that results in a proline to arginine change at position 1703.46 A second interaction is between RELN rs580884 and DISC1 rs3738401. This interaction is associated with decreases in cerebral tissue and an increase in VBR. Rs3738401 is an A/G polymorphism that is a missense mutation; the presence of the minor A allele leads to an arginine to glycine change at position 844. The significant epistatic relationship appears to derive from an AA/TG combination, which occurred in 8 subjects. When these two genotypes co-occur, they have the largest amount of cerebral tissue loss (−.769%) and the largest amount of VBR increase (14.65%). Both are statistically significant (p<.01). A third interaction is between RELN rs499953 and PDE4B rs11576970. This interaction is associated with decreases in frontal and cerebral white matter. The PDE4B SNP is located in or near block 5, 2.8 kilobases away from rs7412571, with which it is in linkage disequilibrium; rs7412571 is significantly associated with schizophrenia.53 The epistatic relationship derives from a TC/GG combination, resulting in a significant decrease in cerebral (−6.10%, p<.01) and frontal (−6.62%, p<.03) WM.
A fourth interaction is between DISC1 rs11578905 and PDE4B rs2455012. This interaction is associated with a decrease in surface and parietal GM. The PDE4B SNP is in LD with two identified alleles, rs1354064 and rs2503222, both of which increase risk for schizophrenia.48,53 A final interaction involving a previously identified allele is between NRG1 and RELN. This interaction is associated with an increase in the volume of the caudate and putamen. The RELN SNP, rs2237628, is in LD with rs2711870, which has been implicated with problems in shifting response set on a card sorting task, an impairment in executive function that has repeatedly been shown to be abnormal in schizophrenia.54 The interaction of the RELN and the NRG1 heterozygotes leads to an increase of 2.22% in the caudate and 3.90% in the putamen.
We have described a method for identifying potential epistatic relationships between genes that may contribute to schizophrenia and affect disease progression in the brain in patients suffering from schizophrenia. The method combines a machine learning algorithm, the ensemble method using stochastic gradient boosting, with traditional general linear model statistics. The method was used to identify genes/SNPs that were interacting with one another and predicting a continuous outcome measure that is a biologically meaningful phenotype (“intermediate phenotype”) for schizophrenia: changes in brain structure occurring after the onset of the illness. The method began with fourteen genes and 810 SNPs and reduced them to five genes and seventeen SNPs, identifying eleven interactions. Five of the SNPs involved in these interactions replicate previous research, in that these five SNPs have previously been identified as schizophrenia vulnerability markers or implicate cognitive processes relevant to schizophrenia. This ability to replicate previous work suggests that our method has potential for detecting a meaningful signal within the human genome.
RELN was the gene that was found to have the highest number of interactions. It had a total of six, involving PDE4B, DISC1, and NRG1. RELN’s relationship with schizophrenia is well-supported by linkage or association studies,45,46 and it has also been shown to be associated with treatment resistance.57 RELN expression is markedly reduced (by 50%) in mRNA and protein in post mortem brain tissue from schizophrenia patients.58 Epigenetic abnormalities have also been found; hypermethylation in the promotor region may be an explanation for the reduced expression of RELN in the brain.55,56 The known functions of RELN are highly consistent with current views that schizophrenia may be due to a disruption in neurodevelopmental processes. RELN encodes a large glycoprotein that affects both early and late neurodevelopment. It is responsible for neuronal migration, cell-cell interactions, and positioning of proliferating neurons early in development, and it also modulates neuronal and synaptic plasticity throughout life.57
DISC1 was also found to have significant epistatic relationships with other genes, including PDE4B and ERBB4 in addition to RELN. DISC1 is a well-established schizophrenia vulnerability gene.59–62 It has multiple SNPs that have been extensively studied, including one identified in the present study (rs3738401).63–69 The prevailing model for the role of DISC1 in schizophrenia emphasizes haploinsufficiency. DISC1 has decreased expression in mRNA and protein in both post mortem tissue and lymphoblastoid cell lines from patients with schizophrenia.66 DISC1 functions are consistent with both the neurodevelopmental hypothesis and current thinking about the mechanisms of schizophrenia. It is a hub protein whose interactors modulate expression of neurodevelopmental, synaptogenic, and sensory perception genes.61–62 Its established interactome includes PDE4B, PDE4D, NDE1, and NDEL1; however, pathway analysis suggests that it may in fact be much larger.70 The DISC1 SNP identified in this analysis, rs3738401, is among those identified as a psychoactive drug target for the treatment of psychosis and/or schizophrenia through an Ingenuity pathway analysis.70
PDE4B was identified as having an epistatic relationship with DISC1 in our analyses, as well as with RELN and ERBB4. PDE4B has also been identified as a risk factor for schizophrenia.48–49,53 It is a large gene (580 kb) with 17 exons; at least five isoforms occur as a result of alternative splicing. PDE4B2 and PDE4B4 isoforms have been found to be reduced in post mortem tissue from schizophrenia patients.48 It is orthologous to Drosophila dunce, a gene involved in learning and memory; dunce mutants have altered axonal grown cone motility, synaptic plasticity, and neuronal function.47 It also influences myelination, consistent with the findings in this study of significant WM loss associated with PDE4B genotypes. Its interaction with DISC1 occurs when the N-terminus of DISC1 binds to the UCR2 regulatory domain of PDE4B.71 This interactome plays a critical role in CNS signaling and homeostasis. DISC1 sequesters PDE4B in an inactive state and releases it when cyclicAMP is elevated; it thereby regulates intracellular signaling by inactivating cAMP.
NRG1 was found to have an epistatic relationship with RELN in our study. It has been implicated in multiple studies as a potential susceptibility gene for schizophrenia.72–76 NRG1 is a large gene (1200 kb) that encodes at least 15 isoforms through alternative splicing.75 It is responsible for a host of functions that affect brain development and cell-cell communications and that are relevant to schizophrenia; they include axon guidance, myelination, glial differentiation, synaptogenesis, synaptic plasticity, and neurotransmission.75 mRNA expression studies implicate variation in isoform expression, due to alternative splicing, as a the most likely molecular mechanism for the genetic association of NRG1 with schizophrenia.77
ERBB4 would be predicted to have an epistatic relationship with NRG1, based on the fact that ERBB4 encodes a receptor for NRG1,75,78 but none was found in our analyses. It did have relationships with PDE4B and DISC 1, which are consistent with the hub protein nature of DISC1. ERBB4 is part of a family of genes for protein tyrosine kinases.75 We found an interaction with a known RELN SNP that is related to cognitive function. This relationship is limited exclusively to two basal ganglia structures, the caudate and putamen. Unlike other components of brain parenchyma, these two structures have increased in size. The epistatic relationship may provide some clue as to the mechanisms of antipsychotic action and, perhaps, toxicity.
This study suggests that exploratory analyses using MLAs may be a fruitful approach for examining the relationships between multiple genes that may be implicated in schizophrenia. It is noteworthy that five of the SNPs identified through this approach have been previously identified as risk alleles, in LD with risk alleles, or relevant to functions that are disrupted in schizophrenia. This replication of prior findings is of particular interest because the prior studies have generally been based on case-control comparisons, while the current study is a case only study, and brain measures rather than diagnosis were the outcome variable. This study also suggests that examination of genomic factors in schizophrenia may be enhanced by using a novel phenotypic outcome measure: brain change over time as measured by MR imaging.
This study has several limitations. It is a case-only sample; an examination of the genetic associations with brain change in a case-control setting would strengthen these results. It is based on a relatively small sample and is primarily useful for hypothesis generation. Because it is an exploratory study, we did not correct for multiple comparisons. The findings need replication in an independent sample. Finally, from a genetic perspective, we know that copy number variations (CNVs) and epigenetic factors such as methylation play an important role in conferring liability to disease. We have not examined any of these factors in this study.
This paper was written with support from the following grants: MHCRC: Neurobiology and Phenomenology of the Major Psychoses (MH43271); Phenomenology and the Classification of Schizophrenia (5R01MH031593); MR Imaging in the Major Psychoses (5R01MH040856); and BRAINS Morphology and Image Analysis (5R01NS050568); Investigator Initiated Research grant from Ortho-McNeil Janssen Scientific Affairs: Genetic Predictors of Long-term Course and Outcome in Schizophrenia.
Drs. Andreasen, Wassink, and Ho report receiving research support from Ortho-McNeil Janssen Scientific Affairs. Marsha Wilcox is an employee of Ortho-McNeil Janssen.
The other authors have no conflicts to report.
Supplementary information is available at Molecular Psychiatry’s website.