We set out to discover and replicate gene effects on brain structure using a gene-centric LASSO regression approach. The goal of the method was to sift through the vast amount of genomic data and come up with a more efficient set of variants for association testing. LASSO allowed us to select sparse subsets of SNPs among all correlated SNPs within each gene and associate them jointly in partial
F-tests with an MRI-derived temporal lobe volume measure. Using this approach, we identified over twenty genes with significant effects on temporal lobe structure in
N
=

729 elderly subjects from the ADNI cohort – a considerably larger number of genes, when compared to a univariate approach, which considers the association of single SNPs one-by-one. In all 22 genes identified, multi-SNP
p-values (from partial
F-tests), using SNPs selected by LASSO, were “more significant” (i.e., lower
p-values, greater effect sizes) than the top genotyped SNP within each corresponding gene, as computed using standard univariate GWAS.
GRIN2B and
NRXN3, which were identified with univariate GWAS (Stein et al.,
2010a), were boosted from
p-values on the order of 10
−7 and 10
−6 to 10
−9 and 10
−8, respectively. In addition, new genes with more significant
p-values were discovered, whose SNPs’ individual
p-values were too weak to pass genome-wide significance in a more standard univariate GWAS experimental design. Furthermore,
post hoc analysis revealed widespread and significant, voxelwise influences for the top genes on TBM maps of the temporal lobes. We also replicated, at least in part, the spatial effects of our most significant finding in the
MACROD2 gene in an independent cohort of healthy, young adults.
Penalized regression techniques such as LASSO (Tibshirani,
1996), ridge regression (Hoerl,
1962), and the elastic net (Zou and Hastie,
2005) have recently been highly effective when used in GWAS. They all deal with (1) multicollinearity due to LD, (2) the large dimensionality of the genome, and (3) the problem of multiple comparisons (Malo et al.,
2008; Cho et al.,
2009,
2010; Lin et al.,
2009; Shi et al.,
2011). LASSO’s emphasis on sparsity is particularly useful in our study, as it helps point to a small set of independent variants in a given gene, which we can then incorporate into a multiple regression framework. This is similar to the approach taken by Chen et al. (
2011), in the context of jointly considering rare and common variants. Here, we tested this algorithm in the context of finding genetic influences on an imaging-derived measure of temporal lobe volume. This allowed us to discover and replicate a great number of genes relative to our earlier imaging GWAS study (Stein et al.,
2010a). We were also able to implicate several genes with previously identified relevance to brain disorders (see below).
In a recent study by our group, Hibar et al. (
2011a) also considered gene-based associations with brain images with a new method based on principal component regression, which associates genes with images by capturing most of the variation among intragenic SNPs. Our approach complements this method, as it instead emphasizes sparsity of the model based on the available SNP data for each gene. In the case of principal components regression, a rather different line of analysis is taken in which the covariance in a set of
N genotyped SNPs is analyzed to produce a reduced set of
k (<
N) predictors, that encode some of the genetic variance, but are more efficient than the original set. That method can also produce overall
p-values for a specific gene to quantify their effects on brain structure. However, the results of principal components regression are less readily ascribed to any specific sets of SNPs on the genome.
Although the
p-values we obtained for the two genes discovered in univariate GWAS were more significant than the univariate
p-values of their top SNPs, this need not be the case for every gene. As also discussed in Hibar et al. (
2011a), there are cases where a univariate test for the top SNP in a gene offers more power to detect an effect than a multivariate
F-test for the whole gene.
ADAMTS2, for instance, is a gene that contains a SNP with the lowest univariate
p-value (rs12513486,
p
=

2.23

×

10
−5) in our dataset just below the significance threshold considered in Stein et al. (
2010a). With our
F-test approach following LASSO regression, the gene actually obtained a weaker association (
p
=

4.83

×

10
−5). Thus, our approach complements univariate GWAS, but does not always boost detection power by including multiple loci.
Several of the genes we identified have been well studied in the context of psychiatric and neurological disorders, including Alzheimer’s disease. Our most significant gene,
MACROD2, which we also replicated in a new cohort, was recently discovered in the context of autism spectrum disorder (ASD), as the gene containing the top SNP (
p
<

5

×

10
−8) in a GWAS of 1,558 families of whom some members had been diagnosed with ASD (Anney et al.,
2010). The investigators of that study reported that although the precise function of this gene is mostly unknown, it is involved in several biological functions and the region comprising their top SNP may regulate
PLD2, a gene coding for a member of a protein family with significant implications for ASD.
MACROD2 has also been associated with schizophrenia, as the gene corresponding to a rare, copy number variant in a linkage analysis (Xu et al.,
2009). The same gene has also been associated with MRI-defined brain infarcts, as the gene comprising the top SNP (
p
<

5

×

10
−7) in a meta-analysis GWAS of >9,000 mostly white, European subjects from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium with an average age of 69.7

years (Debette et al.,
2010). Interestingly, this study also found a suggestive association for a SNP in
GALNTL4, another top gene in our list.
We additionally found boosted associations for another gene,
GRIN2B, coding for a subunit of an
N-methyl-
d-aspartate (NMDA)-type glutamate receptor, and
NRXN3, coding for a neurexin, important for synaptic function, both of which were previously identified in the same dataset with standard GWAS (Stein et al.,
2010a).
We also found significant associations for
SORCS2 and
MAGI2, which are in the AlzGene
18 database of genes that show promising associations with the risk for developing AD based on the literature (Rogaeva et al.,
2007; Potkin et al.,
2009). Additionally,
NPAS3 has been linked to schizophrenia and bipolar disorder (Pickard et al.,
2009),
CLSTN2 has been associated with memory performance (Papassotiropoulos et al.,
2006) and with Alzheimer’s disease (Liu et al.,
2007), and
RBFOX1 (
A2BP1) has been very recently discovered as a splicing regulator of neuronal excitation and calcium homeostasis in the brain (Gehman et al.,
2011).
RBFOX1 has been associated with autism, among other brain disorders (Martin et al.,
2007), and an
RBFOX1 variant has interestingly been detected in another sparse regression study in ADNI (Vounou et al.,
2012).
The discovery and replication populations in this study are quite different: the ADNI cohort consists of elderly subjects within the spectrum of Alzheimer’s disease and the Brisbane cohort consists of healthy young adults. Though the cohorts share a Caucasian background, which minimizes genetic heterogeneity, the age, and health differences between them implies that their brain structure may be influenced by genetic risks and mechanisms that partially overlap but also partially differ. By choosing cohorts that differ in age, we can find polymorphisms that are of enduring relevance over the lifespan, but may be less able to confirm gene effects that only matter in old age. Unfortunately, another elderly cohort with GWAS and MRI scans was not available to us at this time; we requested published GWAS data from other elderly cohorts whose MRI scans we have already analyzed, but our request was declined. However, as discussed in Stein et al. (
2011) who discovered and replicated genetic variants on MRI-derived caudate volume in the same two cohorts, there may be genes with persistent effects over the human lifespan, so any such replication may be even stronger than one observed between more similar populations. In addition, though the genes we identified here may not all be AD genes, the effect of a well-known AD risk conferring polymorphism in the clusterin gene (
CLU), and that of
GAB2, another AD gene, have been replicated as showing associations with brain structure in the same young adult cohort (Braskie et al.,
2011; Hibar et al.,
2012). In other words, we knew in advance that genes associated with brain structure in the elderly may also exert detectable effects in scans from younger people. This was also the case for some SNP effects that were reliably replicated by two very large GWAS consortia analyzing brain scans from cohorts across the lifespan (Stein et al.,
2012) or in elderly cohorts (Bis et al.,
2012). In this study, we were similarly interested to see if our top gene’s effects on temporal lobe structure would replicate in the young, adult cohort, suggesting a more lasting influence on brain structure across a person’s lifetime.
In our study, SNPs were coded additively, i.e., using a value of 0, 1, or 2 for the number of minor alleles. This coding makes the assumption that all SNPs considered in the analysis exert their effects in an additive fashion, as opposed to alternative models such as recessive or dominant. This is certainly not the case for all SNPs, and the assumption likely affects the statistical power of our results, since greatest power is obtained when the true model of a causal allele is implemented (Lettre et al.,
2007). This is a potential limitation of our study. We chose the additive model as previous genome-wide analyses of the same dataset relevant to our work implement the same allelic coding and were successful in finding genetic associations that were later replicated (Stein et al.,
2010a; Hibar et al.,
2011a). Furthermore, the additive model is the most commonly used association model. It is the model assumed in heritability calculations and has been argued to be closest to actual risk models for complex traits, such as our quantitative imaging-based measure (Balding,
2006).
A possible limitation of our results is that we do not implement a nested cross-validation approach, in which SNPs selected from LASSO regression would have been included in
F-tests in non-overlapping subjects. Our implementation of LASSO here, however, fits into a filtering rather than predictive framework and similar data-adaptive filtering followed by
F-tests in the same dataset has been done in previous work (Chen et al.,
2011; Hibar et al.,
2011a). This approach is potentially unfair, as fitting in LASSO is followed by another fitting with multiple linear regression (
F-tests) in the same dataset, whereas fitting is only performed once in a univariate scheme. As GWAS are sensitive to sample size, a nested cross-validation scheme, though more robust, would most likely yield no significant results. We observed this power limitation, as we attempted nested approaches with varying numbers of folds, and were unable to obtain boosted gene-based associations. This may change in the future, as larger datasets become more widely available. Our use of a replication cohort, however, does add credibility to the top result, as the same set of SNPs selected in the discovery sample show significant, spatial effects on brain scans from a completely independent (non-overlapping) group of subjects scanned with a different scanner on a different continent. Another limitation of our approach is that we focus on genes, but exclude promoter and intergenic SNPs. This has the drawback of missing potentially important regulatory elements in the genome.
Our work has several possible future directions, biologically and methodologically. Further investigation is needed to clarify the roles of the genes we identified. We did create voxelwise maps for the top genes, but one could also use a more computationally demanding imaging GWAS approach by re-running the gene-centric, LASSO at each voxel in the brain (Stein et al.,
2010b; Vounou et al.,
2010; Hibar et al.,
2011a), instead of running it on summary measures derived from the images. Sparse coding, used here to reduce the dimensionality of the genomic data, could also be used to zero in on the most promising voxels in the images, leading to a set of phenotypes in the images that show greatest association. Vounou et al. (
2010,
2012), in particular, have proposed a general “reduced rank” method that distils a set of genes and brain measures from regions of interest into a more manageable set for assessing associations. Other approaches for dimension reduction, within both the image and the genome, involve variants of independent components analysis (Liu et al.,
2009). In a recent advance, Chiang et al. (
2011b) proposed to use genetic correlations to identify pairs of voxels in an image with common genetic determination, rather than simply phenotypic correlation. This could be more promising in principle than using phenotypic covariance, as it seems voxel sets are influenced by common (partially overlapping) sets of genes. By clustering these voxels into regions of interest, Chiang et al. (
2011b) were able to boost power to detect genome-wide associations in a large DTI study. Clearly, the promise of multivariate methods for imaging genomics is high. Several variants of linear regression, penalized regression, and machine learning are now being adapted to handle images, with the main goal of boosting power and reducing the very large samples typically considered necessary for replicable findings in genetics (The ENIGMA Consortium,
2011).
In addition to GWAS, an alternative more hypothesis-driven approach is to use candidate gene studies to study the influence of genetic variants on brain structure. These have recently been successful in implicating genes as associated with brain white matter integrity measures derived from diffusion tensor imaging (e.g.,
CLU, Braskie et al.,
2011;
BDNF, Chiang et al.,
2011a;
HFE, Jahanshad et al.,
2012). Furthermore, it will be interesting to study interactions between genes discovered through GWAS by considering the overall pathways or regulatory networks in which they act (Inkster et al.,
2010; Potkin et al.,
2010). Another type of genetic information, not considered here, is rare variants on the genome (Schork et al.,
2009), or copy number variants, which may also be relevant in the determination of brain structure. Ongoing imaging studies are beginning to include proteomic and gene expression data, as well. Such studies may begin to integrate genetic information from different sources to probe the mechanisms of brain pathology and identify means to intervene and resist it.