Imaging genetics is an emergent transdisciplinary research field, where the associations between genetic variations and imaging measures as quantitative traits (QTs) or continuous phenotypes are evaluated. Compared to case–control status, the QTs have increased statistical power and are closer to the underlying biological etiology of the disease making it easier to identify underlying genes (
Braskie et al., 2011;
Potkin et al., 2009;
Shen et al., 2010;
Stein et al., 2010;
Yip and Lange, 2011;
Zhan et al., 2011). Genome-wide association studies (GWAS) have been increasingly performed to correlate high-throughput single nucleotide polymorphism (SNP) data to large-scale image data. While many studies employed a hypothesis-driven approach by making significant reduction in one or both data types (
Glahn et al., 2007), some recent studies examined these associations at the whole genome entire brain level (
Shen et al., 2010;
Stein et al., 2010). Pairwise univariate analysis was typically used in traditional association studies to quickly provide important association information between SNPs and QTs. However, it treated the SNPs and the QTs as independent and isolated units, and therefore the underlying interacting relationships between the units might be lost. Multivariate methods to examine joint effect of multi-locus genotype on a single phenotype were studied in general genetic association studies (
Ballard et al., 2010;
Wu et al., 2010) as well as several recent imaging genetic studies (
Bralten et al., 2011;
Hibar et al., 2011). This paradigm did not consider the relationship between interlinked brain phenotypes and thus still had limited power in revealing complex imaging genetic associations. In this work, taking into account the interrelated structure within and between SNPs and QTs, we propose a new framework for effectively identifying quantitative trait loci, which addresses the following challenges in imaging genetics association study.
First, traditional association studies consider all the SNPs evenly distributed and assess each SNP individually. However, certain SNPs are naturally connected via different pathways. Multiple SNPs from one gene often jointly carry out genetic functionalities. Moreover, linkage disequilibrium (LD) (
Barrett et al., 2005) describes the non-random association between alleles at different loci, through which the SNPs in high LD are linked together in meiosis. Thus, instead of treating SNPs in an isolated manner, it would be beneficial to exploit the group structure among SNPs.
Second, because the functionality of the human brain typically involves more than one cerebral component, investigating each individual regional brain phenotype separately will inevitably lose the interacting relationships between them. For example, the brain's episodic memory network, including medial temporal lobe (MTL) structures, medial and lateral parietal, and prefrontal cortical areas, are normally engaged together during episodic recall (
Walhovd et al., 2010). In addition, accurate prediction of disease status and progression are typically implicated by multiple brain regions coupled with other biomarkers (
Hinrichs et al., 2011;
Zhang et al., 2011). Therefore, jointly analyzing all the imaging phenotypes via one single integral regression model is desirable to elucidate the shared mechanism that may be hidden otherwise.
By recognizing the interrelated nature of these genotypes and phenotypes, in this study, we propose a novel Group-Sparse Multitask Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci in a mild cognitive impairment (MCI) and Alzheimer's disease (AD) study using a few important imaging QTs relevant to AD. We consider each SNP as a feature and each QT as a response variable (i.e. a learning task), and formulate a multitask regression framework including multiple features (SNPs) and multiple responses (QTs). Our goal is to reveal the relationships between these genetic features and imaging phenotypes.
The proposed model consists of three major components. First, it is built upon regression analysis due to the continuous responses of the imaging phenotypes. As a result, the regression coefficients assess the relationships between SNPs and QTs. Second, in order to address the group-wise association among SNPs, inspired by group Lasso (
Yuan and Lin, 2006), we propose a new form of regularization, called as
group
2,1-
norm (
G2,1-
norm, in which the coefficients of the SNPs within a pre-defined group, with respect to all the QTs, are penalized as a whole via
2-norm, while
1-norm is used to sum up the group-wise penalties to enforce sparsity between groups (
Tibshirani, 1996). The latter is important because in reality only a small fraction of genotypes are related to a specific phenotype. Moreover, with sparsity, outliers and irrelevant associations are inherently removed. Lastly, through enforcing
2,1-norm regularization, feature selection becomes an integrated procedure across multiple learning tasks (
Argyriou et al., 2007;
Obozinski et al., 2006), such that the interrelationships among different imaging phenotypes are leveraged. Note that the proposed
G2,1-norm and the enforced
2,1-norm couple a set of learning tasks together such that the regression analysis can be carried out jointly across all the QTs, whereas Lasso (
Tibshirani, 1996) and group Lasso (
Yuan and Lin, 2006) perform regression analysis separately, one task at a time.
We apply the proposed G-SMuRFS method to the Alzheimer's disease neuroimaging initiative (ADNI) cohort (
Weiner et al., 2010) for identifying quantitative trait loci (QTLs) in MCI and AD using a set of imaging phenotypes known to be relevant to AD. Our empirical results yield not only clearly improved prediction performance in all test cases, but also a compact set of SNP predictors relevant to the imaging genotypes that are in accordance with prior studies.