|Home | About | Journals | Submit | Contact Us | Français|
The underlying genetic etiology of late onset Alzheimer’s disease (LOAD) remains largely unknown, likely due to its polygenic architecture and a lack of sophisticated analytic methods to evaluate complex genotype-phenotype models. The aim of the current study was to overcome these limitations in a bi-multivariate fashion by linking intermediate magnetic resonance imaging (MRI) phenotypes with a genome-wide sample of common single nucleotide polymorphism (SNP) variants. We compared associations between 94 different brain regions of interest derived from structural MRI scans and 533,872 genome-wide SNPs using a novel multivariate statistical procedure, parallel-independent component analysis, in a large, national multi-center subject cohort. The study included 209 elderly healthy controls, 367 subjects with amnestic mild cognitive impairment and 181 with mild, early-stage LOAD, Caucasian adults, from the Alzheimer’s Disease Neuroimaging Initiative cohort. Imaging was performed on comparable 1.5T scanners at over 50 sites in the USA/Canada. Four primary “genetic components” were associated significantly with a single structural network including all regions involved neuropathologically in LOAD. Pathway analysis suggested that each component included several genes already known to contribute to LOAD risk (e.g. APOE4) or involved in pathologic processes contributing to the disorder, including inflammation, diabetes, obesity and cardiovascular disease. In addition significant novel genes identified included ZNF673, VPS13, SLC9A7, ATP5G2 and SHROOM2. Unlike conventional analyses, this multivariate approach identified distinct groups of genes that are plausibly linked in physiologic pathways, perhaps epistatically. Further, the study exemplifies the value of this novel approach to explore large-scale data sets involving high-dimensional gene and endophenotype data.
Late onset Alzheimer’s disease (LOAD), the commonest cause of late-life dementia (Bekris et al., 2010) has high heritability (Gatz et al., 2006a; Gatz et al., 2006b). However, its etiopathology, pathogenesis and major risk genes are only partly known, mainly due to its genetic complexity and heterogeneity. The “amyloid hypothesis” seems insufficient to fully explain LOAD etiology and alternative hypotheses continue to be advanced (Pimplikar et al., 2010).
To date only one gene of major effect, apolipoprotein E ε4 (APOE4), replicates as significantly influencing LOAD risk (Strittmatter et al., 1993), but does not account for all genetic variability, suggesting the interplay of multiple, mostly unidentified susceptibility loci of smaller effect size acting multiplicatively under a common disease variant model (Eccles and Tapper, 2010) and/or with environmental factors (Traynor and Singleton, 2010). Recent high-throughput genome wide association studies (GWAS) (van Es and van den Berg, 2009)(Grupe et al., 2007; Harold et al., 2009; Seshadri et al., 2010) have identified and replicated in addition to APOE4, other genes such as BIN1, CLU, ABCA7, CR1, PICALM, MS4A6A, CD33, MSA4E and CD2AP, all of which (apart from APOE) have modest effect sizes and cumulatively account for only 35% of the population attributable risk (Ku et al., 2011; Naj et al., 2011). However, if LOAD risk is mediated in part by common polymorphisms individually conferring low disease risk, acting in concert, typical univariate GWAS might not have enough power to consistently detect these effects unless they utilize very large sample sizes. This might be an inherent issue as obtaining such large sample sizes are usually quite difficult. Also more importantly univariate studies do not take into account the effect of multiple genes at once. This is important because major LOAD risk factors include obesity, cerebrovascular disease and diabetes, all disorders with significant genetic underpinnings (Profenno et al., 2010), suggesting causative genes might belong to common biological pathways shared by these conditions. To circumvent some of these issues, multivariate analyses have been suggested as an approach to identify important genetic factors in LOAD (Gandhi and Wood, 2010).
MRI captures robust phenotypic neuroanatomical LOAD biomarkers, most consistently implicating posterior cingulate and entorhinal cortices, hippocampus and other medial temporal structures (Jack et al., 2010a; Jack et al., 2010b; Smith, 2010; Villain et al., 2010) corresponding to sites of early, severe LOAD-related neuropathology. Imaging genetics attempts to bridge genetic variations with phenotypic trait markers, relating genotypic variations to underlying biological disease etiologies and increasing statistical power, thereby requiring smaller sample sizes (Potkin et al., 2009). However, such strategies require tools to simultaneously accommodate thousands of data points per feature set (e.g. ~105 voxels from imaging data and up to 106 SNPs from genetic data), posing a major statistical challenge. Often, large scale studies are performed in a univariate fashion that significantly limits either one or both feature sets. However, these techniques can curtail the usefulness of multidimensional data to identify potentially informative relationships. Conventional voxel-wise analyses are computationally time consuming on a genome-wide scale and ineffectively capture cumulative effects spread over multiple genes. Prior analyses (Biffi et al., 2010; Potkin et al., 2009; Shen et al., 2010) on the multi-site MRI/genetic ADNI dataset used massively univariate approaches: GWAS, that confirmed the risk status of APOE4 and identified TOMM40 (Shen et al., 2010) and hypothesis-driven analyses using pre-selected known affected brain regions plus GWAS, that reinforced the status of promising individual genes of interest (Biffi et al., 2010). However, no analyses have evaluated the premise that genetic determinants are not randomly distributed among relevant biological pathways but instead grouped together among specific biological processes, nor have they detected predicted groups of common, interactive risk polymorphisms.
Parallel independent component analysis (Para-ICA) a novel multivariate data-driven, hypothesis-free statistical technique, extends ICA to analyze multiple modalities simultaneously (Calhoun et al., 2009). Para-ICA identifies simultaneously clusters of associated, likely interacting genes related to either: (a) functional brain networks, (b) related structural brain regions, or (c) physiologic processes e.g. EEG patterns or other potential endophenotypes and shows their relationships (Calhoun et al., 2009). Beginning with two modalities (here, SNP’s and MRI-derived regional brain volume/thickness), we sought to discover underlying factors from both modalities and their connections. Similar to conventional ICA analyses, extracted structural MRI components are maximally independent within modality and loading coefficients represent variation among individuals. Networks or components extracted from genetic data are groups of interacting SNP loci, contributing with varying degrees to a genetic process affecting a downstream biological function, i.e. linear SNP combinations highly associated with related phenotypes. To date, this technique has been used mainly in schizophrenia and healthy controls to find genes responsible for brain structure and function using MRI and EEG patterns (Jagannathan et al., 2010; Liu et al., 2008; Meda et al., 2010). However, subject and SNP numbers in those studies were typically small.
Genetic and structural MRI data from the ADNI sample provide an ideal test bed to explore LOAD and to validate application of Para-ICA to larger datasets. The subject number (>800) and large genotypic dataset (>600,000 SNPs) allow for examination of feasibility of scaling up this technique where some valid results are published in this dataset from conventional, hypothesis-driven analyses (Biffi et al., 2010). Because many LOAD risk genes remain to be discovered, the technique can simultaneously be used to identify novel risk genes, as it identifies clusters of related, interacting SNPs.
We had the following goals: 1) to evaluate whether Para-ICA could be scaled up to deal with larger populations and many more SNPs than previously analyzed; 2) to identify new risk genes for LOAD and their corresponding endophenotypes and 3) to explore the different LOAD-mediating biological interactive pathways in which the identified risk genes might participate. We hypothesized that the method might identify previously unknown LOAD risk genes, as well as known candidate genes. We hypothesized that identified genes would group into LOAD-associated physiologic pathways and processes.
We evaluated associations between two data modalities, structural MRI (sMRI), (regional brain volumes and cortical thicknesses), and genome-wide genotypic data (SNPs), to reveal multivariate relationships between structural brain regions and SNP’s that differed between healthy controls, MCI and AD subjects.
Data used in the preparation of this article were obtained from the ADNI database (adni.loni.ucla.edu). ADNI results from efforts of many co-investigators from a broad range of academic institutions and private corporations, with subjects recruited from over 50 sites across the U.S. and Canada. For up-to-date information, see www.adni-info.org. The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California – San Francisco.
Data derived from the ADNI database on 818 subjects included baseline 1.5 T MRI scans, Illumina SNP genotyping data, APOE genotype status and demographic information. We limited analyses to European-American ADNI subjects (classified initially into respective ethnic groups based on self-report and validated subsequently using genetic markers) to prevent confluence of population stratification effects on data, yielding a total of 209 HC (Mean/SD Age = 76.05/4.94; 113 Males) with no past history of neurological or psychiatric disorder, 367 subjects with MCI (Mean/SD Age = 74.95/7.37; 239 Males) and 181 subjects with clinically-assessed AD (Mean/SD Age = 75.57/7.48; 100 Males) for analysis.
Sample collection and Single nucleotide polymorphism (SNP) genotyping for more than 620,000 target SNPs across the whole genome was completed on all ADNI participants as described in (Saykin et al., 2010; Shen et al., 2010).
Prior to Para-ICA, genotyped SNP’s underwent two pre-processing stages. First, quality control parameters were employed to discard data unsuitable for further analysis. Samples (both subjects and SNP’s) were checked for missing data and those with missing call rates >5% were excluded. Remaining samples were imputed for missing values (<1%) by replacing data with the corresponding major genotype. Following this, all uninformative SNP’s (constant variance) were excluded. SNP’s were then checked for minor allele frequency (MAF); rare SNP variants with MAFs <0.01 were excluded. Highly correlated SNP’s (r>0.95) (in block sizes of 100kb) were removed. Finally, SNPs (in controls only) were checked for Hardy-Weinberg equilibrium set at a threshold of p<1E-7. QC SNP’s (N=533,872) were then carried over to the next processing stage. The above analyses were performed using custom scripts in Matlab 7.0 (www.mathworks.com).
The above pre-processed SNP’s were subjected to a univariate GWAS type case-control association analysis to identify those differing significantly across the three diagnostic groups. This 1) effectively restricted the core analysis to disease-related genetic data in the current sample, 2) reduced potential noise from interacting genes with little or no relationship to the disease model, providing hypothesis-free data-driven “enrichment,” and 3) improved accuracy and the linking coefficient of the Para-ICA algorithm (determined based on previous simulation results (Liu et al., 2008)). All SNP’s (using an additive model) were entered into a mass univariate ANOVA design with diagnostic group as the independent factor, using Matlab 7.0. SNP’s surviving a liberal p<0.05 uncorrected threshold were then advanced to the Para-ICA multivariate association analyses. As noted, at this stage, no multiple comparison correction was performed. All significant SNP’s with a p<0.05 uncorrected threshold (N=27,150) were carried forward to Para-ICA to determine genetic associations (including weak effects spread across multiple SNPs) with brain structures.
All subjects underwent a high-resolution 3D structural MRI scan (MPRAGE) as detailed in http://www.adni-info.org. We utilized recently published ADNI imaging data, analyzed in Freesurfer V4.1.0 (http://surfer.nmr.mgh.harvard.edu/fswiki) (Shen et al., 2010), thus, brain structure preparation and analysis methods are described only briefly. An automated Freesurfer Bayesian segmentation and parcellation routine extracted and labeled cortical and subcortical tissue classes (Shen et al., 2010), yielding target region volumes, cortical thicknesses and total intracranial volumes for pre-defined brain structure regions-of-interest (ROIs). Freesurfer values for two independently collected MPRAGE scans per subject were averaged to yield a single volume/cortical thickness value. Table 1 lists all imaging phenotypes (N=94; bilateral volumes of interest and cortical thickness values). All values were normalized by Z-score transformation before entry into Para-ICA.
Para-ICA was implemented using the Fusion ICA Toolbox v2.0a; http://icatb.sourceforge.net in Matlab 7.0 to compute independent genetic/imaging networks and simultaneously identify and quantify association between the two modalities/features. This variant of ICA was designed for multimodality processing that extracts components using an entropy term based on information theory to maximize independence and enhances the interconnection by maximizing the linkage function in a joint estimation process (Calhoun et al., 2009; Liu et al., 2008). In addition, Para-ICA estimated loading parameters expressing the weight of the overall component for each subject. Overall correlation values between loading coefficients of the two sets of imaging and genetic component(s) were calculated component-wise for the aggregate sample to identify significantly associated feature sets. Comprehensive mathematical details of the algorithm and methodology are provided in (Liu et al., 2008). Data values from all three diagnostic groups were organized as a matrix of subjects by SNP’s/imaging-ROI values. These genotype and phenotype data matrices were input to the Para-ICA algorithm (see diagram in Figure 1). The number of independent estimated components for both SNP (12 components) and imaging data (8 components) was separately estimated using Akaike information criteria (AIC) (Calhoun et al., 2001). Resulting correlation values between the Para-ICA feature sets were appropriately corrected for multiple comparisons at this stage and a Bonferroni correction was applied based on 12×8=96 comparisons yielding a corrected p value threshold of 0.05/96 = 0.0005. Once significant feature set associations were identified, all contributing SNPs/imaging ROIs across each significant feature/network/component were thresholded at a supra level |Z|>2.0 to specifically identify dominant loadings for each individual network. SNP’s or regions surpassing this threshold were deemed to be contributing significantly to the overall signal of the corresponding component/network. Subsequently, loading coefficients of significantly associated components were tested in a case-control fashion to test if they differed significantly among diagnostic groups.
Significant SNP’s from each component were then batch queried against the dbSNP database http://www.ncbi.nlm.nih.gov/projects/SNP/ to extract corresponding known gene information; genes from this query (derived for each component) were entered into the functional annotation tool, DAVID (http://david.abcc.ncifcrf.gov/) to identify enriched biological themes and visualize these genes on functional pathways e.g. KEGG and/or BioCarta. The Ingenuity pathway analysis tool (IPA; http://www.ingenuity.com/products/pathways_analysis.html) modeled and analyzed the complex biology and genetic interactions as canonical pathway models within the identified significant genetic network(s). SNPs associated with known genes were mapped to the Ingenuity Pathways Knowledgebase to delineate biological networks. Genes were also input to Funcassociate v2.0 (http://llama.mshri.on.ca/funcassociate/) to reveal significantly enriched functional attributes in each component compared against a gene ontology database. Finally, we performed standard chi-square association analyses on the top 10 Z-score-ranked genes from each Para-ICA-derived genetic network, to determine their relative association with the disease model.
Initial data pre-processing with a univariate “GWAS like” analysis (p<0.05 uncorrected) revealed N=27,150 SNPs that differed significantly across groups. It confirmed SNPs from APOE (ε4; p=6.6E-16; ε3; p=3.6E-09) and TOMM40 (p=7.25E-08) as the top three candidate genes, whose genotypes differed significantly across diagnostic groups, as identified in prior GWAS of the same parent dataset (Potkin et al., 2009).
From the 12 SNP and 8 anatomic principal component networks identified by AIC, Para-ICA identified four different independent genetic networks significantly associated with a single structural network (following Bonferroni correction). These four networks were component/networks 1 (G1; consisting of 169 significant genes/332 SNPs), 2 (G2; 182 genes/377 SNPs), 3 (G3; 267 genes/482 SNPs) and 4 (G4; 169 genes/332 SNPs). G1 and G3 had significant loadings (Z>2) from APOE (ε4). All four networks were significantly associated with only one structural brain network (S1) that encompassed 40 different unilateral regions (a combination of both volumes and cortical thickness surpassing a Z=2 threshold). Key structural regions loading heavily in this component were entorhinal cortex and middle temporal cortex thicknesses and amygdala and hippocampus volumes. A complete list of significant regions is highlighted with double asterisks in Table 1 and illustrated in Figure 2. Significant genotype-phenotype correlations (between loading parameters of each modality or feature) were as follows: 1) G1-S1 (r=−0.53; p<0.0001) 2) G2-S1 (r=0.32; p<0.0001) 3) G4-S1 (r=0.24; p<0.0001) and G3-S1 (r=−0.14; p=0.0001). Figure 2 summarizes these data with the top 10 representative genes from each genetic network along with their corresponding biological functions. Supplementary figure 1 (Fig S1) shows correlation (scatter) plots for these associations. Testing loading coefficients of the above networks for between-group differences revealed that in addition to having significant associations they also significantly discriminated groups by baseline diagnosis (AD, MCI, or healthy control). Mean loading coefficients across each genetic/structural network are shown in supplementary Figure 2 (Fig S2).
Ingenuity Pathway Analysis (IPA) software (http://www.ingenuity.com/) was used to detect, visualize, and explore relevant biological networks associated with each genetic component. Top networks for G1 were cellular assembly/organization, cell morphology and development. G2 was enriched with genes related to cardiovascular disease, neurologic disease and cardiac arteriopathy. Primary networks for G4 were cell cycle, cell death and inflammatory response. G3 had an over-representation of genes related to neurological and psychological disorders. All the above networks contained known Alzheimer’s-related proteins in their pathway interactions. Top dynamic networks from each genetic component are illustrated in Figure 3. Based on known gene functions, the top five IPA canonical pathways for each gene network (sorted in terms of genotype-phenotype linkage significance) were as follows:
G1: cAMP mediated signaling, sulfur metabolism, calcium signaling, vascular NO signaling and regulation of IL-2 expression in T lymphocytes. G2: Neuro-protective role of THOP1 in Alzheimer’s, NOS endothelial effects, Type 2 diabetes signaling, tyrosine metabolism, CYP450. G4: cAMP-mediated signaling, cardiac beta-adrenergic signaling, synaptic long-term potentiation, molecular cancer mechanisms, NOS endothelial effects. Significantly associated non-Neurologic disorders were type 2 diabetes (N=80), coronary artery disease (N=71), Crohn’s/inflammatory bowel disease (N=65). G3: Protein kinase A signaling, cardiac beta-adrenergic signaling, cAMP-mediated signaling, amino sugars metabolism, glycosaminoglycan degradation. Significantly associated non-neurologic disorders were Type 2 Diabetes (N=70), Crohn’s disease/IBD (N=57), coronary artery disease (N=57).
The DAVID functional annotation tool revealed that the significant genes from all four genetic networks are involved in multiple biological pathways including Alzheimer’s disease, adherens junction, arrhythmogenic right ventricular and dilated cardiomyopathy, axon guidance, calcium signaling, cell adhesion, ECM receptor interaction, focal adhesion and tight junction and smooth muscle contraction. Figure 4 illustrates (marked with red stars) the significant genes in our study directly related to Alzheimer’s disease on a KEGG pathway map derived from the above tool. Genes analyzed using the Funcassociate v2.0 toolkit revealed several (N=60) significantly over-represented attributes compared against the gene ontology database. The results presented in Table 2 are rank ordered based on adjusted p value along with the number of genes in the query, number of genes in the overall attribute and their odds ratios.
Association analysis (to illustrate their relative disease association) of allelic frequencies for the top ten genes (based on ranked Z-scores) from the four genetic components revealed that the genes SLC9A7, SHROOM2, ZNF673, APOE (ε3, ε4), VPS13C and ATP5G2 showed stronger effects of disease associations compared to other genes within the component. Table 3 details the allelic frequency and the associated statistical value for each of the top 10 genes from all four genetic networks; due to partial overlap (1 gene could appear in more than 1 component) totaling 32 unique genes. A weighting score was derived by normalizing the chi-square value of each gene to the chi-square of the gene most associated with clinical disease status within each component.
As hypothesized, we validated a scaled-up Para-ICA approach to reveal novel interactive genes and pathways for LOAD, thus highlighting one of the primary advantages of Para-ICA which is the use of modest sample sizes compared to conventional GWAS analyses to effectively capture genotype-phenotype relationships. Dominant loading coefficients were contained in all major regions affected by LOAD pathology in the single structural component significantly associated with four different SNP/genetic networks. Seven other structural components were unassociated with other gene components. The genetic components identified included SNPs from APOE4 plus multiple other risk genes (and putatively protective SNPs, e.g. APOE2) either previously identified in LOAD risk (e.g. ATF7 in G3 (Lin et al., 2006)) or involved in one or more biological processes thought to contribute to LOAD pathology. Figure 5 summarizes the involvement of these four genetic risk networks on a LOAD physiologic pathway diagram.
The most significant association was between G1-S1. This genetic network had significant loading contributions from a total of 169 different genes (332 SNPs) and correlated negatively with brain network S1, implying increased genetic load is related to decreased brain volume/thickness within the network. This association was notable as the S1 had high loadings from APOE4 (in the top 10 gene loadings) and S1 included regions known to be affected early and severely in LOAD, including entorhinal, middle temporal and prefrontal cortices and hippocampus. More importantly, this genetic network had high loadings from several other genes (SLC9A7/NHE7, ZNF673, SHROOM2) previously unidentified in LOAD pathology. Given that they were part of the same independent genetic component as APOE, this finding both confirms APOE’s established role as an important LOAD risk gene and suggests that these additional SNPs may interact with APOE to influence disease risk, supporting APOE’s role as a LOAD risk factor rather than a direct cause (Guerreiro et al., 2010). The protein encoded by SLC9A7 mediates Na+/H+ exchange across cell surface plasma membranes (Kagami et al., 2008) cycling between the cell surface and intracellular trans-Golgi network and recycling endosomes, which are vital to APP processing (Marks and Berg, 2010). SLC9A7 co-localizes with actin, implicated in tau formation (Gallo, 2007). LOAD lymphoblasts show abnormalities modulated by sodium/hydrogen exchanger blockers (Urcelay et al., 2001). Overall, this genetic network was enriched with genes dominant in cell signaling pathways. Other strongly contributing genes are involved in lipid transport and tau formation (via actin/myosin binding). ZNF673 is associated with X-linked mental retardation, (Lugtenberg et al., 2006; Ramaswamy et al., 2010) and is close (~0.2 MB) to SCL9A7 on Xp11.3.
The second most significant association in genotype/phenotype correlation was G2-S1. This correlation was positive. G2 comprised 182 unique genes (377 SNPs). Some top-ranked genes from this component overlapped with those from G1, including ZNF673 and SLC9A7. These genes had a significant differential distribution among diagnostic groups, suggesting a role of actin localization and transcriptional regulation in LOAD. Other top genes from this network involved in important AD-related processes included the complement system, involved in amyloid-beta formation and inflammatory damage (van Es and van den Berg, 2009).
Network G4 correlated positively with S1. G4 contained several genes associated with risk for non-neurologic disorders, including diabetes and cardiovascular disease, both LOAD risk factors (Profenno et al., 2010). Top genes from this network, previously unidentified in the context of LOAD, belonged to the complement factor/inhibition pathway related to amyloid-beta clearance (35) or are associated with major histocompatibility class III. Additionally, AKAP9, a top 10 gene in this network, maintains neuronal Golgi integrity and is involved in LOAD pathogenesis (Stieber et al., 1996). Regarding association analysis, no top 10 gene from this network was significantly differentially distributed in the disease groups, suggesting that G4 comprises multiple SNPs of low effect acting together through diverse biological risk pathways, especially inflammation, (see Eikelenboom et al. 2006) to significantly affect LOAD-related neuropathology.
The final genotype-phenotype association was a negative correlation between G3 and S1. G3 included ATP5G2, a subunit of mitochondrial ATP-synthase, which was over-represented in the disease group. Mitochondrial ATP-synthase in entorhinal cortex is a target of oxidative stress in LOAD (Terni et al., 2010) and part of LOAD apoptosis pathways. Several other G3 genes included dominant signaling from CNTN5, recently associated with multiple AD MRI characteristics (Biffi et al., 2010), CEP57, a microtubular/centrosomal localizer (Meunier et al., 2009), MTMR2, an endosomal regulator (Lee et al., 2010), and ATF7, associated with LOAD in Lin et al. (Lin et al., 2006). The loading of previously identified LOAD genes and associated pathobiological pathways further supports the relevance of this genetic network.
Analyzing significant genes from all four components using DAVID and visualizing related processes on KEGG pathways revealed that genes grouped in multiple LOAD-relevant biological processes (see Figure 4). Additional prominent processes not shown in figure included cellular communication, cardiovascular diseases, signal transduction, calcium signaling, cell adhesion and neuronal developmental processes (e.g. axon guidance). Many such processes are implicated in LOAD pathology (e.g. neuronal calcium signaling (Kostiuk et al., 2010; LaFerla, 2002; Mattson and Chan, 2003)). Semaphorin 3A, an axon-guiding membrane protein, accumulates in hippocampus in AD (Koncina et al., 2007).
Major themes deriving from the top 32 Z score-defined genes in the 4 SNP components suggest several major pathophysiological LOAD pathways, especially when such genes co-occurred within a component. From G1, APOE may relate to LOAD risk through pathways not directly linked to amyloid-beta, including actin-related mechanisms. Actin cytoskeletal changes as a path to tau formation (Gallo, 2007) are implicated across all components by SCLC987/NHE7 (Kagami et al., 2008; Ohgaki et al., 2008), SHROOM2 and COBL (Dominguez, 2009) and microtubule-related genes including MTMR2, CEP57 and CTNND2 (Bamburg and Bloom, 2009; Meunier et al., 2009). Three such genes were present in component 1. Immune function, especially the complement system, related to amyloid-beta clearance (Guerreiro et al., 2010; Kolev et al., 2009) and expressed in cerebrovascular smooth muscle (Walker et al., 2008), is suggested by ATF7, CFB, C2, SKIV2L, C6orf10 and C6orf15 (Li et al., 2006; Veerhuis, 2010). These genes support the known role of the complement system in LOAD pathogenesis (van Es and van den Berg, 2009), while adding new gene candidates, e.g. C2. Complement is present in dystrophic LOAD neurites, involved in immune response and linked to synaptic pruning (Hollingworth et al., 2010). Five immune related/complement genes are present in G4.
CTNND2/Delta Catenin/NPRAP is associated with GSK3-beta, hence BAP and tau (Bareiss et al., 2010). CNTN5 encodes contactin5; other contactins participate in LOAD risk, (Biffi et al., 2010; Osterfield et al., 2008). The prominence of SCLC987/NHE7(and MTMR2) suggests the importance of the trans-Golgi network and recycling endosome (Lee et al., 2010). Endosomal processing of APP involving SorLA is of importance in LOAD (Lin et al., 2005; Marks and Berg, 2010; Ohgaki et al., 2008). VPS proteins are related to this process (He et al., 2005; Marks and Berg, 2010), although VPS13C has yet to be implicated. VPS13C is associated with maintenance of plasma glucose levels (Saxena et al., 2010); the related VPS26 is linked with BACE/memapsin2 (He et al., 2005). CL44A4 is involved in choline uptake (Jurgensen and Ferreira, 2010). MTMR2 has relevance to excitatory synapses (Lee et al., 2010). ZKSCAN3/ZNF263 is associated with vascular endothelial growth factor (Yang et al., 2008).
The above data suggest involvement of multiple genes influencing varied, complex pathways that might interact mutually to contribute to LOAD. Output from Para-ICA lends itself readily to functional pathway analysis and ultimately systems biology. We also identified novel putative LOAD risk genes, confirmed via testing allelic frequency distributions among disgnostic groups in standard case-control association analyses. It is notable that while none of these genes survived a standard GWAS study, they have high impact when their effect is evaluated in the context of other SNPs. In addition, the SNP components detected several genes previously unknown in the context of LOAD risk, having high Z scores, exceeding those for APOE. Several of these (e.g. SLC9A7, ZNF673, VPS13) were: (a) identified by multiple (up to 17) SNPs, (b) mediate processes plausibly associated with LOAD risk from pathway analyses and prior publications, (c) had SNPs differentially distributed among diagnostic groups and (d) are prominently expressed in brain. These results suggest validity of these novel loci as candidate LOAD risk genes.
Examining loading coefficients of the gene and structural networks revealed a stepped response pattern (see Fig S2), with MCI values falling between those of healthy control and AD, except for in G3, where they were elevated in MCI compared to AD, suggesting that this gene component may act to either protect against or hasten regional brain deterioration in MCI to influence progression rate to AD.
Our study has limitations. Although the Para-ICA method is data driven, we restricted the genetic dataset to a disease-related subset. This focused analysis might fail to uncover genes affecting LOAD pathology via other interactive pathways that may not straightforwardly show group differences. However, since we employed a liberal statistical threshold to limit the genetic dataset to disease-related genes, we were able to include numerous SNPs discarded by conventional univariate studies. Our AD/MCI-focused gene set analysis may not have detected other genetic associations to brain structure. The analysis was carried out only in European-Americans, by far the most numerous ethnicity in the dataset. Future studies could include larger mixed populations. Also, since para-ICA identified multivariate relationships at the gene network level (comprised of linear combinations of SNPs), the directionality and effect magnitude of individual SNPs is not immediately transparent. Our supplementary association analysis to derive the top SNPs might be slightly biased, as they were already pre-selected at a liberal cut-off to be included in the multivariate analysis. Given these limitations and the novelty of our study, our results require further validation and replication in more diverse and larger independent datasets.
In conclusion, we met our major study goals by 1) confirming the feasibility of a hypothesis-blind, multivariate approach to corroborate LOAD genes associated with known pathologic mechanisms and to discover new putative disease-relevant genes that interact but fail individually to reach genome-wide significance. These data thus extend existing GWAS and hypothesis-driven analyses on the same ADNI data set (Biffi et al., 2010; Saykin et al., 2010; Shen et al., 2010). 2) The Para-ICA approach identifies genes in relatively modest sized samples that are plausibly linked collectively in known physiologic pathways, perhaps epistatically and suggests itself as a novel method for exploring other large-scale data sets involving gene and endophenotype information such as BSNIP or COGS (Calkins et al., 2007; Thaker, 2008), in psychotic disorders where the neuropathology and genetic basis are less well-defined than LOAD. Finally, 3) we identified plausible new biological pathways associated with AD neuropathology. Possible therapies resulting from our findings might include agents targeted to the complement and/or immune systems.
The study was supported by the following grants and research support to Dr. Andrew Saykin from Eli Lilly and Company, Siemens AG, Welch Allyn Inc., the NIH (R01 CA101318 [PI], R01 AG19771 [PI], RC2 AG036535 [Core Leader], P30 AG10133-18S1 [Core Leader], and U01 AG032984 [Site PI and Chair, Genetics Working Group]), the Indiana Economic Development Corporation (IEDC #87884), and the Foundation for the NIH, Dr. Vince Calhoun NIH ROIEB005846. We would also like to thank Mrs. Joanna Mounce for assistance with the biological pathways and network analyses used in these studies.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California,San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129, K01 AG030514, and the Dana Foundation.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of Interest:
Dr. Andrew Saykin receives research support from Eli Lilly and Company, Siemens AG, Welch Allyn Inc.