|Home | About | Journals | Submit | Contact Us | Français|
In recent years, the advent of great technological advances has produced a wealth of very high-dimensional data, and combining high-dimensional information from multiple sources is becoming increasingly important in an extending range of scientific disciplines. Partial Least Squares Correlation (PLSC) is a frequently used method for multivariate multimodal data integration. It is, however, computationally expensive in applications involving large numbers of variables, as required, for example, in genetic neuroimaging. To handle high-dimensional problems, dimension reduction might be implemented as pre-processing step. We propose a new approach that incorporates Random Projection (RP) for dimensionality reduction into PLSC to efficiently solve high-dimensional multimodal problems like genotype-phenotype associations. We name our new method PLSC-RP. Using simulated and experimental data sets containing whole genome SNP measures as genotypes and whole brain neuroimaging measures as phenotypes, we demonstrate that PLSC-RP is drastically faster than traditional PLSC while providing statistically equivalent results. We also provide evidence that dimensionality reduction using RP is data type independent. Therefore, PLSC-RP opens up a wide range of possible applications. It can be used for any integrative analysis that combines information from multiple sources.
The majority of human neurological and psychiatric disorders are substantially heritable (Plomin et al., 1994; Meyer-Lindenberg and Weinberger, 2006; Bigos and Weinberger, 2010; Ge et al., 2013). Since these illnesses represent an actual problem of public health, it is vitally important to understand the underlying genetic mechanisms. Substantial progress has been achieved in recent years with the emergence of genome-wide association (GWA) studies (Haines et al., 2005). Typically, these studies focus on single-nucleotide polymorphisms (SNPs), the most common type of human genetic variation (Wang et al., 1998; Crawford and Nickerson, 2005). However, the identification of new SNPs in GWA studies does not necessarily reveal the variations that also contribute to human diseases. The discovery of the biological function that arises from these DNA sequence variations requires the investigation of the complex relationship between genotype and phenotype information (Pevsner, 2009; 1000 Genomes Project Consortium et al., 2010; Baker, 2012), which is known as the intermediate phenotype concept (Gottesman and Shields, 1967; Gottesman and Gould, 2003). For a number of human neurological and psychiatric disorders, in particular, alterations in brain anatomy, function and connectivity have been shown to be highly heritable and reliably correlated with the disease (Jansen et al., 2015). Consequently, measures derived from in-vivo anatomical or functional neuroimaging were increasingly introduced as intermediate phenotypes for genetic association analyses.
The statistical analysis of the relationship between the SNP and the neuroimaging measures requires the solution of high-dimensional association problems. Since both data sources naturally involve large numbers of variables, computationally efficient analysis frameworks are pivotal. Multivariate statistical techniques have commonly been used in this field, since they are able to combine the information from multiple markers and multiple sources simultaneously into the analysis. However, the higher the dimensionality, the more challenging is the analysis from a statistical and computational point of view. This relates to a phenomenon known as the curse of dimensionality, which states that obtaining a statistically reliable result requires the sample size to grow exponentially with the dimension (Bellman, 1957, 1960). In addition, computation times become excessively long with increasing data dimensionality, posing a serious practical problem for many applications.
One approach for mitigating high-dimensional data analysis is dimensionality reduction. Various dimensionality reduction techniques have been proposed for pre-processing in genetic neuroimaging. In a whole-genome setup including 85,772 SNPs and 34 brain locations of interest, Le Floch et al. (2012) used univariate filters with different thresholds in order to identify a subset of SNPs that were considerably correlated to the neuroimaging data. The reduced set of SNPs was further searched for association with the neuroimaging data using two multivariate strategies, penalized Partial Least Squares regression (Wold, 1975) and regularized Kernel Canonical Correlation Analysis (Hotelling, 1936). The authors showed that a relatively large number of SNPs was needed after filtering in order to comprise all true positives. However, to avoid over-fitting, irrelevant SNPs had to be filtered out although the authors did not define a clear threshold for the filters. In addition to univariate filtering, Le Floch et al. (2012) applied Principal component analysis (PCA) (Pearson, 1901) for dimensionality reduction. Specifically, they performed PCA on both the neuroimaging and the SNP data set and kept as many components as necessary to explain 99% of the variance in each modality. However, all methods based on PCA failed to identify generalizable associations.
PCA-based dimensionality reduction was also conducted in a study by Hibar et al. (2011a,b). The authors searched for associations between 448,293 genome-wide SNPs and 31,662 whole-brain voxels in a large sample of 731 subjects from the Alzheimer's disease neuroimaging initiative (ADNI). To reduce the total number of tests, SNPs were grouped into 18,044 genes based on gene membership. Principal component regression was then implemented to search for the combined effect of multiple SNPs on the brain. Hibar et al. (2011a,b) named their technique voxel-wise gene-wide association study (vGeneWAS). However, no genes identified were significant after correction for multiple testing.
Using the same data set of genome-wide SNPs and whole-brain neuroimaging voxels from the ADNI, in a recent study, Hua et al. (2015) performed dimensionality reduction by selecting 119 brain regions of interest based on an anatomical brain atlas. Distance covariance (Székely et al., 2007) was applied to infer the relationship between the single SNP predictors from the entire genome and the average voxel values at the 119 brain regions, utilized as multivariate response. In order to overcome the multiple testing problem, Hua et al. (2015) also introduced a local false discovery rate (FDR) modeling algorithm. The authors showed that using their method, they were able to find 23,128 significant SNPs at α-level 0.05, while simple linear regression yielded no significant SNPs.
As previous studies have shown, dimensionality reduction using univariate filters or PCA solves critical over-fitting issues in genetic neuroimaging applications (Le Floch et al., 2012). However, both strategies present major limitations. As discussed by Le Floch et al. (2012), univariate filters are not the optimal choice, since they can not account for interdependencies between variables. The genetic and the neuroimaging variants are, however, naturally highly collinear. PCA has two major limitations. First, PCA explicitly depends on the input data, since it uses an optimization criterion to transform the original variables into a set of orthogonal principal components (Goel et al., 2005; Sulić et al., 2010). Second and most importantly, it is computationally expensive, since its runtime is quadratic in the number of dimensions (Menon, 2007). Thus, in genetic neuroimaging, where the data sets may capture the whole brain and the whole genome, respectively, PCA is infeasible in case there is no powerful compute server available in the lab.
A dimensionality reduction technique that is computationally efficient is Random projection (RP) (Johnson and Lindenstrauss, 1984). RP uses a random matrix with unit Euclidean column norms to find a lower-dimensional subspace that approximately preserves the distances between all pairs of data points in the original space (Kaski, 1998; Dasgupta, 2000; Bingham and Mannila, 2001; Lin and Gunopulos, 2003; Vempala, 2004). In a number of data mining and biological studies (Papadimitriou et al., 1998; Bingham and Mannila, 2001; Goel et al., 2005; Sulić et al., 2010; Liu and Fieguth, 2012; Palmer et al., 2015), RP has been shown to provide good results. Several studies also compared RP and PCA and showed that their overall performance was comparably similar, while RP had much lower computational requirements (e.g., Bingham and Mannila, 2001; Goel et al., 2005).
In this study, we propose a new approach that uses RP for dimensionality reduction to efficiently assess the relationship between genetic variation and brain imaging measures as an example for high-dimensional genotype-phenotype association problems. In particular, we consider Partial Least Squares Correlation (PLSC) (McIntosh et al., 1996; Tucker, 1958), which has been shown to be most appropriate for combined analysis of SNP and neuroimaging data in a systematic comparison of multivariate techniques (Grellmann et al., 2015). A key limitation of current PLSC implementations is that they are computationally expensive in handling large numbers of variables. We therefore incorporate RP into PLSC, and name our new method PLSC-RP. Using both simulated and experimental data sets containing whole genome SNPs and whole brain neuroimaging measures, we demonstrate that PLSC-RP provides statistically equivalent results to traditional PLSC. However, using PLSC-RP, runtime is reduced from hours to seconds. We also provide evidence that dimensionality reduction using RP is data type independent and can thus be applied to both continuous and count data. To establish RP for multivariate analysis of high-dimensional genotype-phenotype-associations is, to our knowledge, a strategy never used before. Importantly, compared to previous applications of RP in data mining and biological studies (Papadimitriou et al., 1998; Bingham and Mannila, 2001; Goel et al., 2005; Sulić et al., 2010; Liu and Fieguth, 2012; Palmer et al., 2015), we had to address two additional problems. First, in order to be able to evaluate the contribution of a single genotypic or phenotypic variable, the multivariate associations detected by PLSC-RP must be interpretable in the original spaces after dimensionality reduction. In contrast, previous applications exploited RP such that learning, classification or clustering were carried out directly in the compressed domain (e.g., Liu and Fieguth, 2012; Palmer et al., 2015). Second, in genotype-phenotype association studies, both the genetic and the phenotypic data, are naturally high-dimensional. Thus, RP needs to be applied for dimensionality reduction in both domains. The PLSC-RP method we introduce in this study is favorable since it fulfills both requirements: the back-transformation of results after dimensionality reduction is straightforward, and it can be applied to reduce the number of both the genetic and phenotypic variables.
In this study, we explicitly applied PLSC-RP for efficient assessment of genome-wide and whole-brain relations as an example for high-dimensional association problems. However, the application of PLSC-RP is not limited to the combined analysis of brain imaging and genotype data. It might be considered for fusion of multimodal biological assays such as genomic, transcriptomic and proteomic data, for fusion of multimodal brain imaging techniques or, in epidemiological research, for fusion of environmental factors and measures characterizing health status. Since PLSC-RP depends on sample size only and is independent of the number of variables, it is especially attractive for large-scale multicenter association studies or other data sharing projects.
Random Projection (RP) is a dimensionality reduction technique, which uses a random matrix with unit Euclidean column norms to find a lower-dimensional subspace that approximately preserves the distances between all pairs of data points in the original space (Kaski, 1998; Dasgupta, 2000; Bingham and Mannila, 2001; Lin and Gunopulos, 2003; Vempala, 2004). The concept of RP is as follows: Given a data matrix X N×d, where N is the total number of points and d is the original dimension, RP transforms X to a lower dimensional space via the transformation:
where Rd×k is a random matrix with unit Euclidean column norms and is the low-dimensional subspace with desired lower dimension k. The basic idea for RP is derived from the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984), which states that a set of N points in a high-dimensional space can be mapped onto a dimensional subspace such that the distances between the points are approximately preserved, i.e., not distorted more than by a factor of 1±ϵ, for any ϵ>0. Note that the dimensionality reduction according to Johnson and Lindenstrauss (1984) depends on the number of points N only, since k is logarithmic in N and independent of the original dimension d. A proof of this lemma was provided by several authors (Frankl and Maehara, 1988; Indyk and Motwani, 1998; Dasgupta and Gupta, 1999).
PLSC (McIntosh et al., 1996), first introduced as Tucker Inter-battery Analysis (Tucker, 1958), is a correlation technique that analyzes the association between two sets of variables and . In our application, Matrix X1 collects in each column the phenotype measures, e.g., the brain activity at each voxel in the brain. Matrix X2 stores the genotype measures in each column, e.g., the number of minor alleles for a given SNP. The number of rows corresponds to the sample size.
To model the relationship between X1 and X2, which are both standardized column-wise, PLSC successively builds orthogonal linear combinations of the observed variables of each set (so-called latent variables), such that the covariance between the pair of latent variables is maximized,
where i = 1, …, p, p = min(d1, d2). The relationship between the columns of X1 and X2 is stored in a cross-product matrix . Singular Value Decomposition (SVD) is used to decompose A into three matrices,
where p = min(d1, d2). The coefficients of the PLSC, W1 = [w11, …, w1p] and W2 = [w21, …, w2p], equal the matrices of left and right singular vectors U and V. They describe the contribution of each variable in X1 and X2 to the construction of the linear combinations Z1 = X1W1 and Z2 = X2W2, so-called latent variables or scores. The singular values si, i = 1, …, p, in the SVD provide the covariance between the latent variables.
Assume that and are two data sets, where N is the sample size and d1 and d2 are the number of variables in X1 and X2, respectively. Both X1 and X2 are column-wise standardized. When the number of columns in X1 and X2 is high, PLSC will be computationally expensive. To address the association between X1 and X2 using a computationally more efficient approach, we apply RP for dimensionality reduction prior to PLSC. We denote the PLSC analysis after dimensionality reduction using RP as PLSC-RP analysis. RP can be used to reduce the number of variables in X1 (i.e., the phenotype measures), to reduce the number of variables in X2 (i.e., the genotype measures), or both. Therefore, we multiply the high-dimensional matrices X1 and/or X2 with orthonormal random matrices R1 and/or R2. We generate the elements of the random matrices according to the following algorithm (Dasgupta, 2000):
Orthogonalization is important to preserve distances between the original points in the low-dimensional space (Kaski, 1998). Unfortunately, enforcing the random matrices to be orthogonal requires the Gram–Schmidt algorithm, which is computationally expensive (Sulić et al., 2010). However, it has been shown that in high-dimensional spaces, there exists a much larger number of nearly orthogonal than truly orthogonal vectors (Hecht-Nielsen, 1994). Thus, high-dimensional random matrices might be sufficiently close to orthogonal (Lin and Gunopulos, 2003) and the orthogonalization step might be avoided without affecting the distance preserving properties of Random Projections. Note that two simpler algorithms for generating sparse random matrices have been proposed (Achlioptas, 2001). Multiplying the input matrices with sparse rather than dense random matrices increases multiplication efficiency. For proof of concept of PLSC-RP, however, we decided to generate the elements of the random matrices by a standard normal distribution (Dasgupta, 2000). This way, we avoided potential inaccuracies in our results due to sparsity, which we later cannot evaluate when comparing PLSC-RP to its state-of-the-art counterpart PLSC. We modified the algorithm by Dasgupta (2000) by omitting the orthogonalization using the Gram-Schmidt algorithm, since in our application random matrices were consistently high-dimensional.
Assume that X1 is high-dimensional. RP transforms X1 to a lower dimensional space via the following transformation:
where is a random matrix and is the low-dimensional subspace of X1 with desired lower dimension k. We determine the dimensionality of the low-dimensional subspace matrix X1RP according to the Menon theorem (Menon, 2007), which guarantees the preservation of pairwise distances with probability of at least 1−N−β if for
For our application, we selected ϵ = 1.0 and 1−N−β = 0.95 (yielding β = 0.6505). Note that for the accuracy of the distance preservation ϵ also values or are possible. However, for , the Menon bound adapts best to the lowest reduced dimension of the projection. Accordingly, the term PLSC-RP denotes the process of building latent variables z1RPi = X1RP·w1RPi and z2i = X2·w2i, such that the covariance between the pair of latent variables z1RPi and z2 is maximized,
where i = 1, …, p, p = min(k, d2).
According to Equation (6), we obtain the weights W2 for data set X2, which are approximations for the weights computed using traditional PLSC, since X1RP is only a compressed representation of X1. However, for the reduced data set, we get weights W1RP of the low-dimensional subspace X1RP. To evaluate the contribution of each single variable in X1, we transform the weights W1RP back to the original space, that is W1. Since the weights W2 obtained by SVD on the cross-product matrix are approximations for the weights obtained by performing SVD on , we obtain the original weights W1 by inserting the weights W2 into the equation for the SVD and rearranging it as follows:
where is the cross-product matrix, si is the singular value of component i, i = 1, …, p, p = min(k, d2), when performing SVD based on , and component i is the component explaining the largest proportion of summed squared cross-block correlations (Bookstein, 1994; McIntosh et al., 1996). The derivation of Equation (7) is embodied in the Supplementary Equations (1)–(3). Applying RP to reduce the number of variables in X2 follows the same logic.
PLSC-RP will be significantly faster than traditional PLSC, since it operates on a matrix with a much smaller number of columns ( rather than with kd1). Importantly, permutation testing (Edgington, 1980) can also be performed on the low-dimensional space.
Assume that both X1 and X2 are high-dimensional. RP transforms X1 and X2 to lower dimensional spaces via the following transformation:
We determine the dimensionality of X1RP and X2RP according to Menon (2007). Note that if X1 and X2 have the same number of variables, it is sufficient to generate a joint random matrix R for transformation. In general, however, the number of variables differs between X1 and X2.
Subsequently, we apply PLSC in order to successively build latent variables z1RPi = X1RP · w1RPi and z2RPi = X2RP · w2RPi, such that the covariance between the pair of latent variables z1RPi and z2RPi is maximized, i = 1, …, p, p = min(k1, k2). Since the dimensionality in X1 and X2 is reduced by RP, we obtain weights W1RP and W2RP for the low dimensional subspaces X1RP and X2RP. To evaluate the contribution of each single variable in X1 and X2, we transformed the weights W1RP and W2RP back to the original space, that is W1 and W2, as follows:
where si is the singular value of component i, i = 1, …, p, p = min(k1, k2), when performing SVD based on , and component i is the component explaining the largest proportion of summed squared cross-block correlations (Bookstein, 1994; McIntosh et al., 1996). The derivation of Equation (10) is again elucidated in the Supplementary Material (Supplementary Equations 4–6).
We evaluate the performance of PLSC-RP by comparing its weight profiles to the weight profiles of PLSC applied to the same original data set. To measure similarity, we consider three similarity measures: Pearson correlation (Anderberg, 1973), the cosine measure (Anderberg, 1973) and the extended Jaccard similarity (Strehl and Ghosh, 2000). In addition, we used ANOVA (Chambers et al., 1992) for comparison. The ANOVA model represents a perfect linear relationship between the weights of PLCS and PLSC-RP, in case the intercept equals 0 and the slope equals 1.
Permutation testing is used to assess the significance of the covariance between the pair of latent variables
respectively. For this purpose, observations, i.e., rows of input matrices, are randomly reassigned without replacement and PLSC (PLSC-RP, respectively) are recalculated. At each permutation, the statistic (i.e., the covariance of latent variables) is then compared to the statistic obtained on the original data with probability value equal to the number of times the statistic of permuted data exceeds the original value.
First we illustrate the PLSC-RP methodology for association analysis of two data sets, where one contains high-dimensional brain imaging variables as phenotypes. As an example, we used functional magnetic resonance imaging (fMRI) data, which has been shown to be highly heritable and reliably correlated with a number of human neurological and psychiatric diseases (Jansen et al., 2015). Functional MRI is a brain imaging technique for measuring neural activity based on changes in blood oxygenation and blood flow. The brain activity is accessed in volume pixel elements, called voxels, of the three-dimensional functional magnetic resonance image. Specifically, the activity in a voxel is defined as how closely the time-course of the signal from that voxel matches the expected time-course of activation, which is determined by the experimental design (Barad et al., 2009).
As simulation example, we generated fMRI data of increasing dimensionality (1000, 10,000, 20,000, 30,000, 40,000, 50,000, 70,000, and 90,000 voxels) using multivariate normal distribution with mean and covariance parameters estimated from experimental fMRI contrast images (Le Floch et al., 2012). We searched for associations with 50 candidate SNPs simulated using the gs algorithm (Li and Chen, 2008) based on phase III HapMap data (The HapMap Consortium, 2003). SNPs were recoded using the additive genetic model, counting the number of minor alleles per person. The sample size was chosen to be 100 for all data matrices. For each simulated dimensionality of the fMRI data set, we induced a linear relationship between one randomly selected voxel and three SNPs, such that the pairwise correlation between that voxel, voxels in collinearity with the selected voxel and the selected SNPs was on average 0.3. This is in line with association strengths reported in other studies (Filippini et al., 2009; Potkin et al., 2009; Ousdal et al., 2012).
Using our simulated data, we first applied PLSC to calculate voxel and SNP weights that maximize the covariance between the two data sets. We computed as many components as necessary to explain at least 80% of variance and defined the causal component, that is the component comprising the linear relationship between selected voxels and SNPs, by the out-of-sample covariance, estimated using 10-fold cross-validation (Le Floch et al., 2012). To assess the significance of the covariance and to quantify the reliability of weights, we used permutation testing (Edgington, 1980) and bootstrapping (Tibshirani and Efron, 1993), respectively.
Next, we applied PLSC-RP in order to optimize runtime. As we searched for linear relations between high-dimensional fMRI measures and a comparably small number of SNPs, we applied RP to reduce the number of voxels and kept the raw SNP matrix. To determine the dimensionality of the low-dimensional fMRI subspace matrix, we adopted the Menon theorem (Menon, 2007) and reduced the number of voxels to 100 dimensions (Menon lower bound equaled k = 97.6486) for all considered dimensionalities of the fMRI data set. After back-transformation, we compared the weight profiles of PLSC-RP to the weight profiles obtained using traditional PLSC on the same original data set.
In order to verify our findings on simulated data, we considered an experimental brain imaging genetics data set that has been published previously (Ousdal et al., 2012) to test the hypothesis that monoamines are important modulators of amygdala activity in the brain (LeDoux, 2007). Therefore, the authors combined whole-genome microarray SNPs with fMRI data collected during emotional face-matching task, in which the participants (healthy controls and patients with diagnosis of schizophrenia, bipolar disorder or other psychosis) were presented with two stimuli (human faces expressing anger of fear) that had to be matched to a target stimulus (Hariri et al., 2002; Carre et al., 2010). Genotyping was done using an array-based whole-genome assay and SNPs were recoded using an additive genetic model. The original analysis was performed using SPM2 (Friston et al., 2007) following standard pre-processing pipelines for fMRI data and controlling for diagnosis type. Gender and age were not significantly different across subject groups. To search for genome-wide SNPs modulating amygdala activity, individual contrast values for the right and left amygdala peak voxel were tested for association with each SNP separately. The authors reported a significant association between activation of the left amygdala peak voxel and three SNPs in high linkage disequilibrium, namely rs10014254 (P = 4.16 × 10−8), rs11722038 (P = 4.20 × 10−8) and rs17529323 (P = 4.66 × 10−8). A significant interaction between SNP and diagnosis type was not reported (P = 0.28). A more detailed description of recruitment, experimental task, fMRI data acquisition, genotyping, quality control and statistical analysis is provided in the original publication (Ousdal et al., 2012).
To demonstrate the PLSC-RP methodology, we used updated fMRI images that have been preprocessed using FSL software (Smith et al., 2004) and genotype information on five SNPs (rs10014254, rs11722038, rs17529323, rs382013, and rs437633). Instead of focusing on amygdala as region of interest as in the original study, we considered whole-brain measures in order to detect further brain regions involved in the face-matching task. After removing missing data for all SNPs, we were left with 208 subjects. Statistical analysis was performed as follows. At first we corrected for diagnosis type as in the original publication. Then, we applied PLSC considering whole-brain voxels. Finally, we used our new method PLSC-RP, whereby we reduced the number of voxels and retained raw SNPs. Specifically, we reduced the dimensionality of the fMRI matrix to 208 dimensions, similar to sample size, to match our simulation application above (dimensionality of the phenotype matrix was reduced to 100 dimensions according to Menon (2007), equal to N = 100). We compared the results of PLSC-RP to the results obtained using traditional PLSC. The findings from the original publication (Ousdal et al., 2012) served as a reference.
In Section 2.8.1, we showed that PLSC-RP is remarkably faster than PLSC when applied to high-dimensional phenotype measures. As phenotype measures, we considered brain imaging data, which is scaled metrically. In many applications, however, high-dimensional data is considered that is not continuous. To promote a wider application of PLSC-RP, we considered a second data set, containing genome-wide SNPs as genotypes together with candidate phenotype measures. SNP information statistically represents count data, since SNPs were recoded by counting the number of minor alleles per person.
To illustrate PLSC-RP for count data, we used a data set that has been published previously (Breitfeld et al., 2013). For the original study, participants were recruited from the Sorbs population. The phenotype inventory consisted, among others, of anthropometric data (weight, height, waist-to-hip ratio) and of serum vaspin (Silverman et al., 2001; Hida et al., 2005) measures extracted from blood. SNP genotyping was performed using an array-based whole-genome assay. GWA with serum vaspin was assessed by linear regression in PLINK (Purcell et al., 2007), correcting for age, gender, and BMI. The authors reported a significant association between serum vaspin concentration and six SNPs on chromosome 14, rs11160190 (P = 2.4 × 10−15), rs6575436 (P = 2.1 × 10−8), rs4905203 (P = 2.2 × 10−10), rs1956713 (P = 1.2 × 10−9), rs1956721 (P = 3.6 × 10−9) and rs11621467 (P = 9.2 × 10−10). A more detailed description of participants, phenotyping, genotyping and statistical analysis is provided in the original publication (Breitfeld et al., 2013).
In contrast to the data sets described in Section 2.8.1 (high-dimensional fMRI images), for the current data set, we considered only two experimental phenotypes (serum vaspin concentration and body height), while the genotype data contained whole-genome SNP information (359,845 SNPs for 865 individuals after quality control and removal of missing data). Hence, we applied RP to reduce the number of SNPs and kept the raw phenotype matrix. We adopted our analysis strategy from Section 2.8.1. For PLSC-RP, we reduced the dimensionality of the SNP matrix to 865 dimensions, similar to sample size. PLSC considering all 359,845 SNPs served as a reference.
In Section 2.8.1 and 2.8.2, we illustrated how PLSC-RP performs when either one of the two data sets searched for association is high-dimensional. However, PLSC-RP would be of universal application if it could address both at the same time. Therefore, we generated simulation data, containing high-dimensional fMRI measures as phenotypes and high-dimensional SNPs as genotypes, following our procedure of the previous sections. We generated fMRI and SNP data of six different dimensionality combinations.
The sample size was chosen to be 100 for both data matrices and all combinations of simulated data. Following our analysis strategy described in the previous sections, we performed PLSC on the raw data matrices, and we applied PLSC-RP to optimize runtime. We used RP to reduce the number of voxels and SNPs to 100 dimensions according to the Menon theorem (Menon, 2007) for all considered dimensionality combinations.
First we compared the results of traditional PLSC and PLSC-RP on simulated brain imaging data of increasing dimensionality and candidate SNPs. Specifically, we compared their results with respect to the causal component, the significance of the covariance between latent variables and the runtime (Table (Table1).1). For both methods and for all simulated dimensionalities, causal voxels, and SNPs were represented in the first component, which is associated with the highest covariance, and covariances between latent variables were significant. However, PLSC and PLSC-RP differed considerably in runtime. Importantly, the higher the number of simulated voxels, the more efficient was the dimensionality reduction to 100 dimensions using RP. At maximum, runtime was reduced from 4.2 h to 35.8 s using a standard computer with usual processing power and memory (physical memory: 192 Gi, physical CPUs: two Intel Xeon E5630 CPUs with frequency 2.53GHz and 4 cores).
We further compared PLSC and PLSC-RP with regard to how the individual voxels and SNPs were weighted. A sample illustration of voxel and SNP weights of PLSC for 90,000 simulated voxels and of PLSC-RP, when the dimensionality of the fMRI data set was reduced to 100 dimensions, is provided in Figure Figure1.1. Using both PLSC and PLSC-RP, causal voxels and SNPs, as highlighted in yellow and red, received the highest weights. Average weights for causal and non-causal voxels and SNPs are presented in Supplementary Table 1. It shows that voxel and SNP weights of PLSC and PLSC-RP were highly similar. This was confirmed by three similarity measures (Table (Table2).2). For all simulated dimensionalities of the fMRI data set, we observed a Pearson correlation of sP ≈ 1, a cosine measure of sC ≈ 1 and an extended Jaccard similarity of sJ ≈ 1. In addition, the intercepts of the ANOVA models were approximately equal to 0 and the slopes were approximately equal to 1 (Table (Table2).2). All P-values were smaller than 2 · 10−16.
To verify our findings on simulated data, we compared the results of traditional PLSC and PLSC-RP regarding experimental brain imaging and genetics data. For both PLSC and PLSC-RP, we considered only SNP and voxel weights of the first component, since it already explained a large proportion of variance (72.76 and 69.08% for PLSC and PLSC-RP, respectively). The covariance of latent variables was significant by permutation testing (P = 0.0248 and P = 0.0490 for PLSC and PLSC-RP, respectively). We found that using both, PLSC and PLSC-RP, exactly the same three SNPs (rs10014254, rs11722038 and rs17529323) as reported in the original publication (Ousdal et al., 2012) were reliable. An illustration of voxels related to these SNPs is provided in Figure Figure2.2. We were able to replicate an association with amygdala activity in both hemispheres (Ousdal et al., 2012). Importantly, we found additional brain regions including cerebellum, left hippocampus, left lingual gyrus, right putamen, and left lateral occipital cortex. Voxel weight profiles of PLSC and PLSC-RP were highly similar (Supplementary Table 2). This was confirmed by three similarity measures (Pearson correlation sP = 0.9991, cosine measure sC = 0.9992 and Jaccard similarity sJ = 0.9985) and ANOVA (intercept = 2.14 · 10−5, slope = 0.9948 and P < 2 · 10−16). PLSC-RP was, however, remarkably faster than PLSC, reducing runtime from 2.4 h to 13.4 s.
Next we illustrated the performance of PLSC-RP for association analysis of a data set containing genome-wide SNPs as genotypes together with candidate phenotype measures. In contrast to the brain imaging data in Section 3.1, which was scaled metrically, SNP information statistically represents count data, since SNPs were recoded by counting the number of minor alleles per person.
We observed that both PLSC and PLSC-RP revealed a two component solution. In the first component of the phenotype weight profile, serum vaspin level was highly weighted (|wvaspin| = 0.7068 for both PLSC and PLSC-RP). Body height was most contributing to the second component (|wheight| = 0.9994 for PLSC and |wheight| = 0.9996 for PLSC-RP). Both components were significant by permutation testing (Edgington, 1980) ( and Pheight = 0.02 for both PLSC and PLSC-RP). Since the out-of-sample covariance (Le Floch et al., 2012) was much higher for the first component (covvaspin = 6.69 and covheight = 0.89), we restricted our interpretation to associations with serum vaspin.The overall runtime for PLSC was 36.4 h. PLSC-RP reduced it to 4.8 min.
An illustration of the SNPs, which were associated with serum vaspin concentrations in the Sorbs, is provided in Figure Figure3.3. Using bootstrapping (Tibshirani and Efron, 1993), we showed that exactly the same SNPs on chromosome 14 that were reported in the original publication (Breitfeld et al., 2013) were reliable, including rs11160190, rs6575436, rs4905203, rs1956713, and rs11621467. We did not find an association for rs1956721 because we had to exclude this SNP due to its low call rate. The SNP weight profiles of PLSC and PLSC-RP were highly similar (Pearson correlation sP = 0.9999, cosine measure sC = 0.9999, Jaccard similarity sJ = 0.9999, ANOVA intercept = −1.95 · 10−8, slope = 0.9999993 and P < 2 · 10−16, Supplementary Table 3).
According to our results presented in Section 3.1 and 3.2, PLSC-RP provides statistically the same results as PLSC when we perform dimensionality reduction of either one of the two data sets searched for association. However, PLSC-RP would be of universal application if it could address both at the same time. Therefore, we compared the results of PLSC-RP to the results obtained using traditional PLSC for simulation data containing high-dimensional fMRI measures and high-dimensional SNPs.
As for our simulation results on high-dimensional neuroimaging data (Section 3.1), causal voxels and SNPs were represented in the first component for all dimensionality combinations and for both PLSC and PLSC-RP (Table (Table3).3). The covariances between latent variables were non-significant, since chance correlations have a considerable effect in multivariate models such as PLSC if variable numbers in both data sets are excessively high compared to sample size. In terms of runtime, PLSC-RP was remarkably faster than traditional PLSC. At maximum, runtime was reduced from ca. 3757 h to 1.2 min.
Sample illustrations of voxel and SNP weights for 1000 voxels and 50,000 SNPs (dimensionality combination 6) are provided in Figure Figure4.4. In general, using both PLSC and PLSC-RP, causal voxels and SNPs got the highest weights. Weight profiles of PLSC and PLSC-RP were comparably similar (Supplementary Table 4 and Table Table4).4). However, compared to our applications of RP for dimensionality reduction of high-dimensional continuous or high-dimensional count data, the degree of similarity was reduced, on average, from 0.99 to 0.94. The two approaches, PLSC and PLSC-RP, mainly differed in terms of weights provided for non-causal voxels and SNPs, whereas the weights for causal variables were approximately equal. Note that in Figure Figure44 the magnitude of PLSC and PLSC-RP voxel and SNP weights is comparably similar (especially for causal variables), whereas the direction of weights is reversed. However, the direction of weights is irrelevant, as long as the sign of both voxel and SNP weights is reversed for PLSC-RP compared to PLSC.
In addition to our causal SNPs highlighted in Figure Figure4,4, several other SNPs were provided with high weights by PLSC and PLSC-RP, including e.g., SNP 14,891, SNP 20,330, SNP 22,630, SNP 23,349, and SNP 49,279 in the case of dimensionality combination 6. All of these SNPs were actually linked to the causal voxel, as revealed by means of Pearson correlation (rSNP 14, 891 = −0.3857, rSNP 20, 330 = −0.3471, rSNP 22, 630 = −0.3230, rSNP 23, 349 = −0.3293 and rSNP 49, 279 = 0.3825).
For dimensionality reduction using RP, random matrices need to be orthogonalized in order to preserve distances between the original points in the low-dimensional space. Unfortunately, Gram–Schmidt orthogonalization is computationally expensive. However, it has been shown by Hecht-Nielsen (1994) that in high-dimensional spaces, there already exists a large number of nearly orthogonal vectors, such that high-dimensional random matrices are sufficiently close to orthogonal and orthogonalization using the Gram–Schmidt algorithm can be omitted. In order to quantify whether Gram–Schmidt orthogonalization is necessary in high-dimensional genetic neuroimaging data sets, we reran the analyses from Section 3.1 once using the Gram–Schmidt algorithm to orthogonalize random matrices and once without orthogonalization. Random matrices for the analyses with and without Gram–Schmidt orthogonalization were the same but differed from the random matrices in Section 3.1. We compared
using the similarity measures Pearson correlation, cosine measure and extended Jaccard similarity as well as ANOVA. Results are illustrated in Table Table5.5. It shows that the similarity between the weights of traditional PLSC and PLSC-RP using Gram–Schmidt algorithm for orthogonalization was only slightly higher as (for 1000, 10,000, and 40,000 voxels) or close to identical to (for 20,000, 30,000, 50,000, 70,000, and 90,000 voxels) the similarity between the weights of traditional PLSC and PLSC-RP when we omitted Gram–Schmidt orthogonalization. Furthermore, the similarity between the weights of PLSC-RP with and without Gram–Schmidt orthogonalization was higher than the similarity between the weights of either method and traditional PLSC. More specifically, the higher the dimensionality of the simulated fMRI data and therefore the higher the number of rows of the random matrices, the higher was the similarity between the weights of PLSC-RP with and without Gram–Schmidt orthogonalization (e.g., Pearson correlation for 1000 voxels and , Pearson correlation for 10,000 voxels and , Pearson correlation for 20,000 voxels or more and ).
Since for PLSC-RP dimensionality is reduced using RP, the results will vary slightly in every run of the algorithm due to the random draw of the random matrices R in Equation (4) or R1 and R2 in Equation (8). In order to quantify this variability, we reran the analyses from the previous sections 10 times each using different random matrices in every run of PLSC-RP. The results are illustrated in Table Table6.6. It shows that the variability of PLSC-RP weights was very small in general. For simulated fMRI data of 90,000 voxels and 50 candidate SNPs (Figure (Figure1),1), it averaged 7.93 · 10−8±5.58 · 10−8 for voxel weights and 7.77 · 10−4±3.84 · 10−4 for SNP weights. For simulated high-dimensional fMRI and high-dimensional SNP data (Figure (Figure4),4), it accounted for, on average, 5.05 · 10−5±3.27 · 10−5 for voxel weights and 1.13 · 10−6±9.19 · 10−7 for SNP weights. For experimental genotype-phenotype data, the variability of PLSC-RP results was even smaller. In the fMRI face-matching task (Figure (Figure2),2), it averaged 4.10 · 10−9±5.15 · 10−9 for voxel weights and 7.38 · 10−4±1.09 · 10−3 for SNP weights. However, the variability of the causal SNPs rs10014254, rs11722038 and rs17529323 was very low with 5.92 · 10−7±1.72 · 10−9 on average. The variability of non-causal SNPs was considerably higher and accounted for, on average, 2.41 · 10−3±8.03 · 10−4. For the SNPs associated with serum vaspin concentration in the Sorbs (Figure (Figure3),3), we determined similar variability results. For phenotype weights, we determined an average variance of 3.69 · 10−4±4.63 · 10−4 and for SNP weights 1.29 · 10−9±1.84 · 10−9. The variability of the causal SNPs rs11621467, rs4905203, rs1956713, rs6575436, and rs11160190 was even smaller with 1.02 · 10−9±6.73 · 10−10 on average.
Here we report a new method for efficiently performing multivariate analysis of high-dimensional genotype-phenotype association data, which we termed PLSC-RP. In a simulation series containing high-dimensional brain imaging measures of increasing voxel numbers as phenotypes and candidate SNPs as genotypes, we compared PLSC-RP to traditional PLSC and demonstrated that they provide statistically highly similar results, independent of the number of simulated voxels. Importantly, the higher the dimensionality, the more the processing time was reduced using PLSC-RP instead of PLSC.
PLSC-RP is independent of the original dimensionality, because dimensionality reduction is performed according to the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984). The Johnson-Lindenstrauss lemma states that if we pick a random subspace with reduced dimension k of an originally high-dimensional data, the pairwise distances between the original data points are preserved relative to an error ϵ, which we are willing to accept. Thereby, the reduced dimension k is logarithmic in the sample size N, which implies that random projections, and thus PLSC-RP, are independent of the original dimensionality. Since a reasonably high sample size N is sufficient to ensure the accuracy of results, regardless of the number of original variables, PLSC-RP is optimal in large scale settings. This includes genetic neuroimaging studies, where the high number of variants in the human genome and the high number of voxels in the brain complicate the identification of variations that are causally related to a particular disease (Schork et al., 2007).
In order to verify our simulation results, we applied traditional PLSC and PLSC-RP to an experimental genetic neuroimaging data set that has been published previously (Ousdal et al., 2012). We were able to replicate literature findings using both PLSC and PLSC-RP, with PLSC-RP being again significantly faster than PLSC. Specifically, we found three SNPs in high LD, rs10014254, rs11722038, and rs17529323, to be significantly associated to amygdala activity. These SNPs are located upstream of the Paired-like homeobox 2b (PHOX2B) gene, which is known to regulate the expression of enzymes necessary for the biosynthesis of monoamines (Brunet and Pattyn, 2002). Thus, we verified the authors hypothesis (Ousdal et al., 2012) that the monoaminergic signaling pathway plays a central role in the regulation of amygdala activity. In contrast to Ousdal et al. (2012), who only considered the amygdala peak voxels to search for influencing SNPs, we considered whole brain measures. Therefore, we also found some other brain regions to be involved during the emotional face-matching task, including cerebellum, left hippocampus, left lingual gyrus and right putamen. All these brain regions have been shown to be increasingly activated during processing of negative emotional faces (Fusar-Poli et al., 2009; Surguladze et al., 2010; Benedetti et al., 2011; Schraa-Tam et al., 2012; Demenescu et al., 2013). To date, it is not confirmed by the literature that individual differences in activation of these brain regions might be explained by variation of PHOX2B SNPs. However, we found evidence for an association of hippocampus with another gene influencing the monoaminergic signaling pathway, the gene that encodes the enzyme MAOA (Lee and Ham, 2008).
We compared PLSC-RP to traditional PLSC on a data set containing genome-wide SNPs as genotypes together with candidate phenotype measures (body height and serum vaspin concentration extracted from blood, specifically). In contrast to the brain imaging measures, which we selected as high-dimensional phenotypes in the first application and which were scaled metrically, SNP information statistically represents count data, since PLSC was performed under the assumption of an additive genetic model. We showed that PLSC-RP provides statistically equivalent results to PLSC, despite significant savings in runtime. Therefore, dimensionality reduction using RP is data type independent. Using both PLSC and PLSC-RP, we were further able to replicate literature findings (Breitfeld et al., 2013). In the original publication, six SNPs on chromosome 14, mapping between serpinA1 and serpinA4, were shown to be significantly associated to serum vaspin measures. With the exception of one SNP that we had to exclude due to its low call rate, we verified these findings. We did not find any additional SNPs, since the authors in the original publication (Breitfeld et al., 2013) already performed a genome-wide screening.
Finally, we applied PLSC-RP on a simulated data set containing both high-dimensional brain imaging and high-dimensional SNP measures of different dimensionality combinations. We showed that PLSC-RP was able to detect causal voxels and SNPs with high accuracy, despite slightly reduced similarity to the results provided by PLSC compared to the applications of PLSC-RP and PLSC for either high-dimensional neuroimaging measures or high-dimensional SNPs. Again, PLSC-RP was significantly faster than PLSC. At maximum, runtime was reduced from 22 weeks to 1.2 min. Thus, we strongly recommend the use of PLSC-RP even if both data sets are high-dimensional.
For dimensionality reduction using RP, random matrices need to be orthogonal. To quantify whether Gram–Schmidt orthogonalization is necessary or whether high-dimensional random matrices are sufficiently close to orthogonal without orthogonalization, we repeated our analyses from Section 3.1 once using the Gram–Schmidt algorithm to orthogonalize random matrices and once without orthogonalization. We showed that the weights of PLSC-RP using Gram–Schmidt algorithm and the weights of PLSC-RP when we omitted orthogonalization were close to identical for all simulated fMRI data sets. Thus, the quality of PLSC-RP results is not depended on the Gram-Schmidt algorithm in high-dimensional association analyses, such that a preceding orthogonalization of random matrices can be safely omitted. This is also suggested, since the Gram-Schmidt algorithm is computationally expensive. We further observed that the similarity was slightly higher, the higher the dimensionality of the simulated fMRI data. Therefore, an orthogonalization becomes less necessary, the more voxels and/or SNPs are considered. In genetic neuroimaging, where the data sets usually capture the whole brain or the whole genome, respectively, the Gram-Schmidt algorithm can be omitted. However, for smaller data sets, we recommend to orthogonalize random matrices.
Since the dimensionality reduction in PLSC-RP depends on the choice of random matrices, the algorithm produces slightly different results in every run. To quantify this variability, we repeated our analyses 10 times each, using different random matrices in every run of PLSC-RP. We showed that the variability of PLSC-RP results was very small in general. The highest variability we determined for a single genotypic or phenotypic variable was 0.0025. On average, however, variability was much lower. Thus, PLSC-RP is appropriate for both exploratory analyses, in order to detect causal SNPs and phenotypes maximizing the joint covariance in an association study, and replication analyses, when multiple runs of the algorithm are performed. Repeatability of PLSC-RP results is assured even if different random matrices are selected for dimensionality reduction in every run of the algorithm.
We further observed that variability of PLSC-RP results was smaller the higher the sample size of our data sets. For our simulation experiments, the sample size was chosen to be 100, and the variability of PLSC-RP weights including all voxels and SNPs was between 1.95 · 10−3 and 2.42 · 10−9. In contrast, when we used PLSC-RP to search for genome-wide SNPs associated with serum vaspin concentration, sample size was equal to 865, and the variability of PLSC-RP weights including the phenotypes serum vaspin concentration and body height as well as all SNPs was between 9.04 · 10−4 and 1.82 · 10−15. Dimensionality reduction in PLSC-RP depends on the sample size N, and, according to the Johnson-Lindenstrauss lemma, it assures to preserve the distances between the original data points when N is logarithmic in the reduced dimension k (Johnson and Lindenstrauss, 1984). We reduced the dimensionality of the fMRI and/or SNP matrices to sample size in all applications. Thus, the higher the sample size, the higher was also the accuracy of distance preservation for a given probability of success (Table (Table7).7). The higher the accuracy of the distance preservation, the higher the degree of similarity between the results of PLSC-RP and traditional PLSC, and likewise, the higher also the degree of similarity between the results of multiple runs of PLSC-RP using different random matrices.
With combining PLSC and RP, we made two major contributions to the analysis of genetic neuroimaging data. First, in genetic neuroimaging studies, the number of variables usually exceeds the number of observations, such that multivariate methods encounter critical over-fitting issues (Le Floch et al., 2012). Our new technique uses RP for dimensionality reduction in order to circumvent this problem. Previous studies (Hibar et al., 2011a,b; Le Floch et al., 2012; Hua et al., 2015) also contributed to this end by implementing univariate filters or PCA as pre-processing step. However, they performed dimensionality reduction on either the genetic or the neuroimaging data set. In contrast, to illustrate the PLSC-RP methodology, we systematically used a two-stepped approach. First, we applied PLSC-RP for multivariate analysis of data sets containing either high-dimensional neuroimaging measures or high-dimensional SNPs. Neuroimaging measures are scaled metrically, but SNPs are counts. Therefore, we demonstrated that PLSC-RP is data-type independent. Then, we considered data sets containing a combination of high-dimensional neuroimaging measures and high-dimensional SNPs, and performed dimensionality reduction on both domains. This has not been done before.
Our second and most important contribution is related to computational efficiency. Previous studies (Hibar et al., 2011a,b; Le Floch et al., 2012; Hua et al., 2015) implemented univariate filters or PCA as pre-processing step, which are computationally very expensive procedures. In contrast, PLSC-RP is able to dramatically reduce runtime and enables researchers to analyze truly high-dimensional data sets, even if there is no powerful compute server available in the lab. Hence, our study is the first of its kind that implements dimensionality reduction both to overcome critical over-fitting issues and to reduce runtime.
In this study, we applied PLSC-RP to efficiently assess genome-wide and whole-brain associations. Combining neuroimaging data with genetic information is a rapidly growing research approach, enabling the integration of information from two of the major methodological advances introduced in the past 30 years, namely sequencing of the entire human genome and fMRI in humans. However, the application of PLSC-RP is not limited to the combined analysis of genotypes and brain imaging phenotypes. It opens up a wide range of possible uses far beyond imaging genetics. PLSC-RP might be considered for fusion of several brain imaging techniques, such as MRI, positron emission tomography (PET), diffusion tensor imaging (DTI), or magnetoencephalography (MEG) in order to profit from the benefits of each modality (Sui et al., 2012). It is suitable for the integrated analysis of disease status and multiple types of “omics” data, such as genomics, epigenomics, and transcriptomics, aiming to understand signs of malfunction that cause diseases. Furthermore, it can be applied to investigate how concentrations of biomolecules in different tissues or different species, such as mice and humans, are associated to each other. To summarize, PLSC-RP is appropriate for any integrative analysis which combines information from multiple sources and has therefore a multitude of potential applications. Since PLSC-RP depends on sample size only and is independent of the number of variables, it is especially attractive for large-scale multicenter association studies or other data sharing projects.
CG, SB, and AH designed the theoretical approach. CG implemented the approach. CG designed and performed the simulation experiment. AT, PK, LW, and OA collected the experimental data. CG analyzed the data. PK, AT, LW, OA, MS, and AV contributed materials. AH, JN, and SB advised on statistics. CG wrote the manuscript. JN, SB, PK, AT, LW, OA, MS, AV, and AH edited the manuscript and approved its final version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors would like to thank Tobias Wohland for his excellent support during validation of the statistical approach. This work was supported by the German Federal Ministry of Education and Research (IFB AdiposityDiseases, FKZ: 01EO1001) to CG, JN, MS, AV, and AH, the German Research Foundation (CRC 1052 Obesity mechanisms) to JN, PK, AT, MS, AV, and AH, the Research Council of Norway (#213837, #223273) to LW and OA, the South-East Norway regional health authority to LW, and KG Jebsen Foundation to OA.
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene.2016.00102
Software code, sample input data and complete documentation are available at http://www.o-brain-project.de/en/methoden/.