|Home | About | Journals | Submit | Contact Us | Français|
Penalized or sparse regression methods are gaining increasing attention in imaging genomics, as they can select optimal regressors from a large set of predictors whose individual effects are small or mostly zero. We applied a multivariate approach, based on L1-L2-regularized regression (elastic net) to predict a magnetic resonance imaging (MRI) tensor-based morphometry-derived measure of temporal lobe volume from a genome-wide scan in 740 Alzheimer’s Disease Neuroimaging Initiative (ADNI) subjects. We tuned the elastic net model’s parameters using internal crossvalidation and evaluated the model on independent test sets. Compared to 100,000 permutations performed with randomized imaging measures, the predictions were found to be statistically significant (p ~ 0.001). The rs9933137 variant in the RBFOX1 gene was a highly contributory genotype, along with rs10845840 in GRIN2B and rs2456930, discovered previously in a univariate genomewide search.
Many early studies in imaging genetics explored univariate associations between genotypes and imaging measures, assuming each gene acted independently. One disadvantage of such studies is their limited statistical power to detect gene effects on the brain. Meta-analyses such as the Enhancing Neuro Imaging Genetics through Meta-Analysis (ENIGMA) project  have boosted statistical power, by analyzing MRI and genome-wide genotype data from over 20,000 subjects, gaining power from very large sample sizes. Multivariate approaches, which simultaneously consider entire sets of genotypes, sets of voxels in an image, or both, have also become more popular , as they also handle potential problems in high-dimensional data, such as highly correlated predictors, where almost all have no detectable effects.
In , we reviewed several recent multivariate, imaging genetics studies that applied principal component regression , sparse reduced rank regression , or independent components analysis  to discover genetic influences on the brain that would have been missed by using only univariate techniques. Regularized, sparse regression methods, in particular, use penalty terms to tackle the problems of high dimensionality (e.g., having more predictors than samples), multiple highly correlated measures, and multiple comparisons across an image, the genome, or both. The “elastic net” combines L1- and L2- norm regularization and benefits from the advantages of both methods, to handle high-dimensional, highly correlated data. The algorithm takes advantage of the sparsity properties of L1 (Least Absolute Shrinkage and Selection Operator, or LASSO), along with the stability of L2 (ridge) regression . Here, we introduce an elastic net approach to predict an imaging measure from top genotypes. We aim to incorporate top genetic variants (i.e., single nucleotide polymorphisms or SNPs), screened based on univariate genome-wide search (as in a genome-wide association analysis or GWAS), into an elastic net model, to predict temporal lobe volume on MRI. Recently, the elastic net has been applied to genomics [7,8], for jointly considering genetic polymorphisms as well as imaging , to integrate large numbers of imaging and clinical predictors. More recently, the algorithm has also been used to detect multi- SNP associations with hippocampal surface morphometry , and to integrate imaging and proteomic data in Alzheimer’s disease .
We hypothesize that this doubly regularized, multivariate regression method would allow us to make significant predictions of MRI-derived temporal lobe volume from genotypes. This predictive approach, we propose, may have implications for early, personalized risk assessment of brain disorders such as Alzheimer’s disease, where the temporal lobes undergo significant atrophy.
ADNI subjects were scanned with a standard MRI protocol optimized for reproducibility and consistency across 58 sites in North America. Temporal lobe volumes were derived from an anatomically defined region-of-interest (ROI) on three-dimensional maps of relative volumes generated with tensor-based morphometry (TBM), a well-established method to map volumetric differences in the brain . Temporal lobe volume is particularly interesting, as this structure is prone to atrophy in Alzheimer’s disease (AD). There is interest in discovering genes that may promote or resist the atrophy, or contribute to normal variations in its volume. A total of 740 subjects with both imaging and genotype data were included (173 with AD, 361 with mild cognitive impairment or MCI, and 206 cognitively healthy controls; 438 men and 302 women; mean ± SD age: 75.55 ± 6.79 years).
Genotyping procedures for ADNI are described in . SNPs with minor allele frequencies less than 0.01 and Hardy-Weinberg equilibrium p-values less strict than 5.7 × 10−7 were excluded. Genotypes were imputed to infer missing information.
The elastic net  is a form of penalized regression, where both L1 and L2 regularizations are introduced into the standard multiple linear regression model, as formulated below for n subjects and p predictors:
Here, y represents the vector whose n components are the imaging measure for each subject, after adjusting for sex and age (residuals of regression). X is the n × p matrix of genotypes for top genetic variants across the genome. β* represents the vector of fitted regression coefficients for each SNP’s effect on the imaging measure. λ1 is a positive weighting parameter on the L1 penalty, which promotes sparsity in the resulting set of fitted regression coefficients, as many coefficients are likely to be exactly zero. λ2 is a positive weighting parameter on the L2 penalty, which promotes stability in the regularization path and precludes a limit on how many variables are selected (in strict LASSO, at most n variables can be selected in an n by p case).
In ten separate experiments (Figure 1), we randomly split the data into training sets with 3n/4 and testing sets with n/4 subjects. Standard univariate associations were performed for all ~500,000 genotyped variants with the imaging measure, using the training set only, and top 4,000 SNPs were then fed into the elastic net algorithm. This is a common pre-screening step that has been used in similar contexts . Leave-one-out cross-validation was performed within the training sets to determine the optimal penalty parameters with the mean squared error criterion. Both λ1 and α are optimized with a grid search, where a = λ2 / (λ1 + λ2), such that the penalty term of (1), P, is restated as below:
Mean squared error is commonly minimized for parameter tuning using cross-validation, similarly to previous studies in this context [10,11]. To avoid bias, cross-validation for selecting hyperparameters is done separately from evaluation of the model. Models trained to have optimal penalty parameters were tested on the test sets to obtain mean squared errors for predicting the imaging measure from genotypes. For our analyses, we used the ‘glmnet’ package  implemented in R (http://cran.r-project.org). This optimizes model fitting parameters via an efficient, coordinate descent algorithm.
A similar procedure was repeated 100,000 times. To reduce computational time, unlike the actual experiments, only the optimal penalty parameters were used and a fixed set of top 4,000 SNPs from a univariate genome-wide search were incorporated into the models. Imaging measures were randomly assigned to all subjects, after which the data was randomly split into training and testing sets as above. Mean squared errors for prediction of test set temporal lobe volumes were then obtained for each permutation.
Standard multiple regression cannot be used in our scenario, as the multivariate analysis for all top SNPs would fail (i.e., the model fitting equation would be ill-conditioned), as there are many more variants than subjects (p n problem).
To perform post-hoc, exploratory tests on our top SNPs, we created voxelwise statistical maps to reveal the spatial profile of associations with regional brain volumes. We fitted linear associations at each voxel, adjusted for covariates (sex and age). To correct for multiple spatial comparisons, we used a regional False Discovery Rate (FDR) method, which is now fairly standard in neuroimaging .
We averaged the mean squared errors of the optimized predictive models on test sets. An average mean squared error of 3,147 was obtained with the elastic net predictor in independent sets of test subjects. The average mean squared error in the 100,000 permutations was 4,257 with a standard deviation of 397. Compared to the distribution of the errors across the permutations (Figure 2), the p-value is found to be close to 0.001.
To investigate which genetic variants contributed most to the predictions, we examined the average absolute values of coefficients for each fitted predictor. Out of the 4,000 variants incorporated into the elastic net models in each of the ten trials, 105 were screened for all trials. We investigated the coefficients obtained by these SNPs. The top ten are shown in Table 1. To ensure that the findings were robust, we also counted the number of times the variants received nonzero coefficients across the ten runs (Table 1). With permutations, each SNP obtained a nonzero coefficient only about 2.0 ± 0.5 SD times, on average.
We noted that rs10845840 in the GRIN2B gene and the intergenic rs2456930, which were the top findings with a univariate genome-wide search , also appeared in our top list, which is a re-assuring validation. Interestingly, rs9933137 in the RBFOX1 gene also obtained a very high mean |β| and outperformed the top univariate SNP in GRIN2B. To explore the profile of effects of the RBFOX1 SNP on temporal lobes in more detail, we performed an exploratory, post-hoc voxelwise test, shown in Figure 3.
We proposed a multivariate model to predict an imaging measure from genotypes, using L1-L2 regularized regression, also known as the elastic net. We split 740 ADNI subjects into training and test sets in ten separate trials. We optimized elastic net parameters in the training set using leave-one-out cross-validation, and predictions were made on the independent test sets. This is a rigorous predictive framework, as it avoids the overfitting that can arise if training data are used for testing. We also compared the performance of our predictor with that of 105 permutations, where MRI measures were randomly assigned to the subjects. Our predictions were significantly better than those made by random models. Although the main goal of our study was prediction rather than discovery, we also looked for the variants that most strongly contributed to the predictions. Using average elastic net coefficients as a metric, we found a single nucleotide polymorphism in the RBFOX1 gene to be most contributory to the predictive models, which also showed significant 3D effects on the temporal lobes. This gene, also known as A2BP1, has been previously characterized as an autism risk gene , and regulates neuronal excitation in the brain . Interestingly, it has also been discovered in another sparse regression imaging genetics study as a highly significant gene . Future studies are needed to compare the performance of this predictor with other multivariate techniques. Prescreening of genetic variants, which was done as a way of reducing dimensionality similarly to previous studies , may be a limitation, as it might lead to missing potential effects from contributory genes. Furthermore, applying multi-voxel methods [4,5,19] and incorporating biological pathway information may yield more statistically powerful predictions.
ADNI data collection was supported by federal and private funds including NIH grants U01 AG024904, P30 AG010129, K01 AG030514, and the Dana Foundation. The ADNI Genetics Core, led by Andrew Saykin, performed the ADNI genotyping. OK was partially supported by the UCLA Medical Scientist Training Program. Algorithm development was supported by AG016570, EB01651, RR019771 (to PT).