Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
Genetic markers can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as population genetics, and the search for susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Existing methods to identify structure informative markers demand prior knowledge of the membership of the studied individuals to predefined populations. In this paper, based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), we develop a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers. Our method is very fast even when applied to datasets of hundreds of individuals and millions of markers. We evaluate this method on a large dataset of 11 populations from around the world, as well as data from the HapMap project. We show that, in most cases, we can achieve 99% genotyping savings while at the same time recovering the structure of the studied populations. Finally, we show that our algorithm can also be successfully applied for the identification of structure informative markers when studying populations of complex ancestry.
A key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A well-established technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data.
We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA).
On simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment.
IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' web-interface.
Motivation: High-throughput single-cell quantitative real-time polymerase chain reaction (qPCR) is a promising technique allowing for new insights in complex cellular processes. However, the PCR reaction can be detected only up to a certain detection limit, whereas failed reactions could be due to low or absent expression, and the true expression level is unknown. Because this censoring can occur for high proportions of the data, it is one of the main challenges when dealing with single-cell qPCR data. Principal component analysis (PCA) is an important tool for visualizing the structure of high-dimensional data as well as for identifying subpopulations of cells. However, to date it is not clear how to perform a PCA of censored data. We present a probabilistic approach that accounts for the censoring and evaluate it for two typical datasets containing single-cell qPCR data.
Results: We use the Gaussian process latent variable model framework to account for censoring by introducing an appropriate noise model and allowing a different kernel for each dimension. We evaluate this new approach for two typical qPCR datasets (of mouse embryonic stem cells and blood stem/progenitor cells, respectively) by performing linear and non-linear probabilistic PCA. Taking the censoring into account results in a 2D representation of the data, which better reflects its known structure: in both datasets, our new approach results in a better separation of known cell types and is able to reveal subpopulations in one dataset that could not be resolved using standard PCA.
Availability and implementation: The implementation was based on the existing Gaussian process latent variable model toolbox (https://github.com/SheffieldML/GPmat); extensions for noise models and kernels accounting for censoring are available at http://icb.helmholtz-muenchen.de/censgplvm.
Supplementary information: Supplementary data are available at Bioinformatics online.
Classification of breast cancer patients into different risk classes is very important in clinical applications. It is estimated that the advent of high-dimensional gene expression data could improve patient classification. In this study, a new method for transforming the high-dimensional gene expression data in a low-dimensional space based on wavelet transform (WT) is presented.
The proposed method was applied to three publicly available microarray data sets. After dimensionality reduction using supervised wavelet, a predictive support vector machine (SVM) model was built upon the reduced dimensional space. In addition, the proposed method was compared with the supervised principal component analysis (PCA).
The performance of supervised wavelet and supervised PCA based on selected genes were better than the signature genes identified in the other studies. Furthermore, the supervised wavelet method generally performed better than the supervised PCA for predicting the 5-year survival status of patients with breast cancer based on microarray data. In addition, the proposed method had a relatively acceptable performance compared with the other studies.
The results suggest the possibility of developing a new tool using wavelets for the dimension reduction of microarray data sets in the classification framework.
breast cancer; microarray data; supervised wavelet; support vector machine
Multivariate image analysis tools are used for analyzing dynamic or multidimensional Positron Emission Tomography, PET data with the aim of noise reduction, dimension reduction and signal separation. Principal Component Analysis is one of the most commonly used multivariate image analysis tools, applied on dynamic PET data. Independent Component Analysis is another multivariate image analysis tool used to extract and separate signals. Because of the presence of high and variable noise levels and correlation in the different PET images which may confound the multivariate analysis, it is essential to explore and investigate different types of pre-normalization (transformation) methods that need to be applied, prior to application of these tools. In this study, we explored the performance of Principal Component Analysis (PCA) and Independent Component Analysis (ICA) to extract signals and reduce noise, thereby increasing the Signal to Noise Ratio (SNR) in a dynamic sequence of PET images, where the features of the noise are different compared with some other medical imaging techniques. Applications on computer simulated PET images were explored and compared. Application of PCA generated relatively similar results, with some minor differences, on the images with different noise characteristics. However, clear differences were seen with respect to the type of pre-normalization. ICA on images normalized using two types of normalization methods also seemed to perform relatively well but did not reach the improvement in SNR as PCA. Furthermore ICA seems to have a tendency under some conditions to shift over information from IC1 to other independent components and to be more sensitive to the level of noise. PCA is a more stable technique than ICA and creates better results both qualitatively and quantitatively in the simulated PET images. PCA can extract the signals from the noise rather well and is not sensitive to type of noise, magnitude and correlation, when the input data are correctly handled by a proper pre-normalization. It is important to note that PCA as inherently a method to separate signal information into different components could still generate PC1 images with improved SNR as compared to mean images.
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the ‘curse of dimensionality’, occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition, to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the to the inherent nonlinear structure of gene- and protein-expression studies, our claim is that the nonlinear DR methods provide a more truthful low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by (i) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different low-dimensional data embeddings and (ii) 5 cluster validity measures to evaluate the size, distance and tightness of object aggregates in the low-dimensional space. For each of the 7 evaluation measures considered, statistically significant improvement in the quality of the embeddings across 10 cancer datasets via the use of 3 nonlinear DR schemes over 3 linear DR techniques was observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the 6 DR methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g. cancer subtypes) within the data.
Dimensionality reduction; bioinformatics; data clustering; data visualization; machine learning; manifold learning; nonlinear dimensionality reduction; gene expression; proteomics; prostate cancer; lung cancer; ovarian cancer; principal component analysis; linear discriminant analysis; multidimensional scaling; Isomap; locally linear embedding; laplacian eigenmaps; classification; support vector machine; decision trees; LLE; PCA
Motivation: Gene set analysis allows formal testing of subtle but coordinated changes in a group of genes, such as those defined by Gene Ontology (GO) or KEGG Pathway databases. We propose a new method for gene set analysis that is based on principal component analysis (PCA) of genes expression values in the gene set. PCA is an effective method for reducing high dimensionality and capture variations in gene expression values. However, one limitation with PCA is that the latent variable identified by the first PC may be unrelated to outcome.
Results: In the proposed supervised PCA (SPCA) model for gene set analysis, the PCs are estimated from a selected subset of genes that are associated with outcome. As outcome information is used in the gene selection step, this method is supervised, thus called the Supervised PCA model. Because of the gene selection step, test statistic in SPCA model can no longer be approximated well using t-distribution. We propose a two-component mixture distribution based on Gumbel exteme value distributions to account for the gene selection step. We show the proposed method compares favorably to currently available gene set analysis methods using simulated and real microarray data.
Software: The R code for the analysis used in this article are available upon request, we are currently working on implementing the proposed method in an R package.
Gene microarray technology is an effective tool to investigate the simultaneous activity of multiple cellular pathways from hundreds to thousands of genes. However, because data in the colossal amounts generated by DNA microarray technology are usually complex, noisy, high-dimensional, and often hindered by low statistical power, their exploitation is difficult. To overcome these problems, two kinds of unsupervised analysis methods for microarray data: principal component analysis (PCA) and independent component analysis (ICA) have been developed to accomplish the task. PCA projects the data into a new space spanned by the principal components that are mutually orthonormal to each other. The constraint of mutual orthogonality and second-order statistics technique within PCA algorithms, however, may not be applied to the biological systems studied. Extracting and characterizing the most informative features of the biological signals, however, require higher-order statistics.
ICA is one of the unsupervised algorithms that can extract higher-order statistical structures from data and has been applied to DNA microarray gene expression data analysis. We performed FastICA method on DNA microarray gene expression data from Alzheimer's disease (AD) hippocampal tissue samples and consequential gene clustering. Experimental results showed that the ICA method can improve the clustering results of AD samples and identify significant genes. More than 50 significant genes with high expression levels in severe AD were extracted, representing immunity-related protein, metal-related protein, membrane protein, lipoprotein, neuropeptide, cytoskeleton protein, cellular binding protein, and ribosomal protein. Within the aforementioned categories, our method also found 37 significant genes with low expression levels. Moreover, it is worth noting that some oncogenes and phosphorylation-related proteins are expressed in low levels. In comparison to the PCA and support vector machine recursive feature elimination (SVM-RFE) methods, which are widely used in microarray data analysis, ICA can identify more AD-related genes. Furthermore, we have validated and identified many genes that are associated with AD pathogenesis.
We demonstrated that ICA exploits higher-order statistics to identify gene expression profiles as linear combinations of elementary expression patterns that lead to the construction of potential AD-related pathogenic pathways. Our computing results also validated that the ICA model outperformed PCA and the SVM-RFE method. This report shows that ICA as a microarray data analysis tool can help us to elucidate the molecular taxonomy of AD and other multifactorial and polygenic complex diseases.
When the number of phenotypes in a genetic study is on the scale of thousands, such as in studies concerning thousands of gene expression levels, the single-trait analysis is computationally intensive, and heavy adjustment of multiple comparisons is required. Traditional multivariate genetic linkage analysis for quantitative traits focuses on mapping only a few phenotypes and is not feasible for a large number of traits. To cope with high-dimensional phenotype data, clustering analysis and principal-component analysis (PCA) are proposed to reduce the data dimensionality and to map shared genetic contributions for multiple traits. However, standard clustering analysis and PCA are applicable for independent observations. In most genetic studies, where family data are collected, these standard analyses can only be applied to founders and can lead to the loss of information. Here, we proposed a clustering method that can exploit family structure information and applied the method to 29 gene expression levels mapped to a reported hot spot on chromosome 14. We then used a PCA approach based on heritability applicable to small number of traits to combine phenotypes in the clusters. Lastly, we used a penalized PCA approach based on heritability applicable to arbitrary number of traits to combine 150 gene expression levels with the highest heritability. Genome-wide multipoint linkage analysis was carried out on the individual traits and on the combined traits. Two previously reported peaks on chromosomes 14 and 20 were identified. Linkage evidence was stronger for traits derived from methods that incorporate family structure information.
We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.
Two different approaches have become widely used in the analysis of population structure: admixture-based models and principal components analysis (PCA). In admixture-based models each individual is assumed to have inherited some proportion of its ancestry from one of several distinct populations. PCA projects the individuals into a low-dimensional subspace. On the face of it, these methods seem to have little in common. Here we show how in fact both of these methods can be viewed within a single unifying framework. This viewpoint should help practitioners to better interpret and contrast the results from these methods in real data applications. It also provides a springboard to the development of novel approaches to this problem. We introduce one such novel approach, based on sparse factor analysis, which has elements in common with both admixture-based models and PCA. As we illustrate here, in some settings sparse factor analysis may provide more interpretable results than either admixture-based models or PCA.
Microarray data sets provide relative expression levels for thousands of genes for a small number, in comparison, of different experimental conditions called assays. Data mining techniques are used to extract specific information of genes as they relate to the assays. The multivariate statistical technique of principal component analysis (PCA) has proven useful in providing effective data mining methods. This article extends the PCA approach of Rollins et al. to the development of ranking genes of microarray data sets that express most differently between two biologically different grouping of assays. This method is evaluated on real and simulated data and compared to a current approach on the basis of false discovery rate (FDR) and statistical power (SP) which is the ability to correctly identify important genes.
This work developed and evaluated two new test statistics based on PCA and compared them to a popular method that is not PCA based. Both test statistics were found to be effective as evaluated in three case studies: (i) exposing E. coli cells to two different ethanol levels; (ii) application of myostatin to two groups of mice; and (iii) a simulated data study derived from the properties of (ii). The proposed method (PM) effectively identified critical genes in these studies based on comparison with the current method (CM). The simulation study supports higher identification accuracy for PM over CM for both proposed test statistics when the gene variance is constant and for one of the test statistics when the gene variance is non-constant.
PM compares quite favorably to CM in terms of lower FDR and much higher SP. Thus, PM can be quite effective in producing accurate signatures from large microarray data sets for differential expression between assays groups identified in a preliminary step of the PCA procedure and is, therefore, recommended for use in these applications.
Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data.
A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods.
Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.
Principal component analysis (PCA) is a data analysis method that can deal with large volumes of data. Owing to the complexity and volume of the data generated by today's advanced technologies in genomics, proteomics, and metabolomics, PCA has become predominant in the medical sciences. Despite its popularity, PCA leaves much to be desired in terms of accuracy and may not be suitable for certain medical applications, such as diagnostics, where accuracy is paramount. In this study, we introduced a new PCA method, one that is carefully supervised by receiver operating characteristic (ROC) curve analysis. In order to assess its performance with respect to its ability to render an accurate differential diagnosis, and to compare its performance with that of standard PCA, we studied the striatal metabolomic profile of R6/2 Huntington disease (HD) transgenic mice, as well as that of wild type (WT) mice, using high field in vivo proton nuclear magnetic resonance (NMR) spectroscopy (9.4-Tesla). We tested both the standard PCA and our ROC-supervised PCA (using in each case both the covariance and the correlation matrix), 1) with the original R6/2 HD mice and WT mice, 2) with unknown mice, whose status had been determined via genotyping, and 3) with the ability to separate the original R6/2 mice into the two age subgroups (8 and 12 wks old). Only our ROC-supervised PCA (both with the covariance and the correlation matrix) passed all tests with a total accuracy of 100%; thus, providing evidence that it may be used for diagnostic purposes.
Diagnostic methods; principal component analysis; receiver operating characteristic (ROC) curve analysis; metabolomics; nuclear magnetic resonance spectroscopy; huntington disease
Motivation: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data.
Results: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies.
Conclusion: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.
Availability and implementation: The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article.
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.
A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.
The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.
Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.
Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.
The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.
The L1 norm has been applied in numerous variations of principal component analysis (PCA). L1-norm PCA is an attractive alternative to traditional L2-based PCA because it can impart robustness in the presence of outliers and is indicated for models where standard Gaussian assumptions about the noise may not apply. Of all the previously-proposed PCA schemes that recast PCA as an optimization problem involving the L1 norm, none provide globally optimal solutions in polynomial time. This paper proposes an L1-norm PCA procedure based on the efficient calculation of the optimal solution of the L1-norm best-fit hyperplane problem. We present a procedure called L1-PCA* based on the application of this idea that fits data to subspaces of successively smaller dimension. The procedure is implemented and tested on a diverse problem suite. Our tests show that L1-PCA* is the indicated procedure in the presence of unbalanced outlier contamination.
principal component analysis; linear programming; L1 regression
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.
Binary data; Dimension reduction; MM algorithm; LASSO; PCA; Regularization; Sparsity
Molecular and epidemiological evidence demonstrate that altered gene expression and single nucleotide polymorphisms in the apoptotic pathway are linked to many cancers. Yet, few studies emphasize the interaction of variant apoptotic genes and their joint modifying effects on prostate cancer (PCA) outcomes. An exhaustive assessment of all the possible two-, three- and four-way gene-gene interactions is computationally burdensome. This statistical conundrum stems from the prohibitive amount of data needed to account for multiple hypothesis testing.
To address this issue, we systematically prioritized and evaluated individual effects and complex interactions among 172 apoptotic SNPs in relation to PCA risk and aggressive disease (i.e., Gleason score ≥ 7 and tumor stages III/IV). Single and joint modifying effects on PCA outcomes among European-American men were analyzed using statistical epistasis networks coupled with multi-factor dimensionality reduction (SEN-guided MDR). The case-control study design included 1,175 incident PCA cases and 1,111 controls from the prostate, lung, colo-rectal, and ovarian (PLCO) cancer screening trial. Moreover, a subset analysis of PCA cases consisted of 688 aggressive and 488 non-aggressive PCA cases. SNP profiles were obtained using the NCI Cancer Genetic Markers of Susceptibility (CGEMS) data portal. Main effects were assessed using logistic regression (LR) models. Prior to modeling interactions, SEN was used to pre-process our genetic data. SEN used network science to reduce our analysis from > 36 million to < 13,000 SNP interactions. Interactions were visualized, evaluated, and validated using entropy-based MDR. All parametric and non-parametric models were adjusted for age, family history of PCA, and multiple hypothesis testing.
Following LR modeling, eleven and thirteen sequence variants were associated with PCA risk and aggressive disease, respectively. However, none of these markers remained significant after we adjusted for multiple comparisons. Nevertheless, we detected a modest synergistic interaction between AKT3 rs2125230-PRKCQ rs571715 and disease aggressiveness using SEN-guided MDR (p = 0.011).
In summary, entropy-based SEN-guided MDR facilitated the logical prioritization and evaluation of apoptotic SNPs in relation to aggressive PCA. The suggestive interaction between AKT3-PRKCQ and aggressive PCA requires further validation using independent observational studies.
Prostate cancer; Apoptosis; Single nucleotide polymorphisms; Gene-gene interactions; Multifactor dimensionality reduction (MDR); Statistical epistasis networks (SEN)
With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia.
Using the correlation-maximizing methods, regularized dual CCA and PCA+CCA, we show that without pre-selection of known disease-relevant genes, and without using information about clinical class membership, an exploratory analysis singles out two patient groups, corresponding to well-known leukemia subtypes. Furthermore, the variables showing the highest relevance to the extracted features agree with previous biological knowledge concerning copy number alterations and gene expression changes in these subtypes. Finally, the correlation-maximizing methods are shown to yield results which are more biologically interpretable than those resulting from a covariance-maximizing method, and provide different insight compared to when each variable set is studied separately using PCA.
We conclude that regularized dual CCA as well as PCA+CCA are useful methods for exploratory analysis of paired genetic data sets, and can be efficiently implemented also when the number of variables is very large.
The highly dimensional data produced by functional genomic (FG) studies makes it difficult to visualize relationships between gene products and experimental conditions (i.e., assays). Although dimensionality reduction methods such as principal component analysis (PCA) have been very useful, their application to identify assay-specific signatures has been limited by the lack of appropriate methodologies. This article proposes a new and powerful PCA-based method for the identification of assay-specific gene signatures in FG studies.
The proposed method (PM) is unique for several reasons. First, it is the only one, to our knowledge, that uses gene contribution, a product of the loading and expression level, to obtain assay signatures. The PM develops and exploits two types of assay-specific contribution plots, which are new to the application of PCA in the FG area. The first type plots the assay-specific gene contribution against the given order of the genes and reveals variations in distribution between assay-specific gene signatures as well as outliers within assay groups indicating the degree of importance of the most dominant genes. The second type plots the contribution of each gene in ascending or descending order against a constantly increasing index. This type of plots reveals assay-specific gene signatures defined by the inflection points in the curve. In addition, sharp regions within the signature define the genes that contribute the most to the signature. We proposed and used the curvature as an appropriate metric to characterize these sharp regions, thus identifying the subset of genes contributing the most to the signature. Finally, the PM uses the full dataset to determine the final gene signature, thus eliminating the chance of gene exclusion by poor screening in earlier steps. The strengths of the PM are demonstrated using a simulation study, and two studies of real DNA microarray data – a study of classification of human tissue samples and a study of E. coli cultures with different medium formulations.
We have developed a PCA-based method that effectively identifies assay-specific signatures in ranked groups of genes from the full data set in a more efficient and simplistic procedure than current approaches. Although this work demonstrates the ability of the PM to identify assay-specific signatures in DNA microarray experiments, this approach could be useful in areas such as proteomics and metabolomics.
Several tools have been developed to explore and search Gene Ontology (GO) databases allowing efficient GO enrichment analysis and GO tree visualization. Nevertheless, identification of highly specific GO-terms in complex data sets is relatively complicated and the display of GO term assignments and GO enrichment analysis by simple tables or pie charts is not optimal. Valuable information such as the hierarchical position of a single GO term within the GO tree (topological ordering), or enrichment within a complex set of biological experiments is not displayed. Pie charts based on GO tree levels are, themselves, one-dimensional graphs, which cannot properly or efficiently represent the hierarchical specificity for the biological system being studied.
Here we present a new method, which we name PCA2GO, capable of GO analysis using complex multidimensional experimental settings. We employed principal component analysis (PCA) and developed a new score, which takes into account the relative frequency of certain GO terms and their specificity (hierarchical position) within the GO graph. We evaluated the correlation between our representation score R and a standard measure of enrichment, namely p-values to convey the versatility of our approach to other methods and point out differences between our method and commonly used enrichment analyses. Although p values and the R score formally measure different quantities they should be correlated, because relative frequencies of GO terms occurrences within a dataset are an indirect measure of protein numbers related to this term. Therefore they are also related to enrichment. We showed that our score enables us to identify more specific GO-terms i.e. those positioned further down the GO-graph than other common tools used for this purpose. PCA2GO allows visualization and detection of multidimensional dependencies both within the acyclic graph (GO tree) and the experimental settings. Our method is intended for the analysis of several experimental sets, not for one set, like standard enrichment tools. To demonstrate the usefulness of our approach we performed a PCA2GO analysis of a fractionated cardiomyocyte protein dataset, which was identified by enhanced liquid chromatography-mass spectrometry (GeLC-MS). The analysis enabled us to detect distinct groups of proteins, which accurately reflect properties of biochemical cell fractions.
We conclude that PCA2GO is an alternative efficient GO analysis tool with unique features for detection and visualization of multidimensional dependencies within the dataset under study. PCA2GO reveals strongly correlated GO terms within the experimental setting (in this case different fractions) by PCA group formation and improves detection of more specific GO terms within experiment dependent GO term groups than standard p value calculations.
Detecting structure in population genetics and case-control studies is important, as it exposes phenomena such as ecoclines, admixture and stratification. Principal Component Analysis (PCA) is a linear dimension-reduction technique commonly used for this purpose, but it struggles to reveal complex, nonlinear data patterns. In this paper we introduce non-centred Minimum Curvilinear Embedding (ncMCE), a nonlinear method to overcome this problem. Our analyses show that ncMCE can separate individuals into ethnic groups in cases in which PCA fails to reveal any clear structure. This increased discrimination power arises from ncMCE's ability to better capture the phylogenetic signal in the samples, whereas PCA better reflects their geographic relation. We also demonstrate how ncMCE can discover interesting patterns, even when the data has been poorly pre-processed. The juxtaposition of PCA and ncMCE visualisations provides a new standard of analysis with utility for discovering and validating significant linear/nonlinear complementary patterns in genetic data.
Principal-component analysis (PCA) has been used for decades to summarize the human genetic variation across geographic regions and to infer population migration history. Reduction of spurious associations due to population structure is crucial for the success of disease association studies. Recently, PCA has also become a popular method for detecting population structure and correction of population stratification in disease association studies. Inspired by manifold learning, we propose a novel method based on spectral graph theory. Regarding each study subject as a node with suitably defined weights for its edges to close neighbors, one can form a weighted graph. We suggest using the spectrum of the associated graph Laplacian operator, namely, Laplacian eigenfunctions, to infer population structures instead of principal components (PCs). For the whole genome-wide association data for the North American Rheumatoid Arthritis Consortium (NARAC) provided by Genetic Workshop Analysis 16, Laplacian eigenfunctions revealed more meaningful structures of the underlying population than PCA. The proposed method has connection to PCA, and it naturally includes PCA as a special case. Our simple method is computationally fast and is suitable for disease studies at the genome-wide scale.
Proteomics may help us better understand the changes of multiple proteins involved in oncogenesis and progression of prostate cancer(PCa) and identify more diagnostic and prognostic biomarkers. The aim of this study was to screen biomarkers of PCa by the proteomics analysis using isobaric tags for relative and absolute quantification(iTRAQ).
The patients undergoing prostate biopsies were classified into 3 groups according to pathological results: benign prostate hyperplasia (BPH, n = 20), PCa(n = 20) and BPH with local prostatic intraepithelial neoplasm(PIN, n = 10). Then, all the specimens from these patients were analyzed by iTRAQ and two-dimensional liquid chromatography-tandem mass spectrometry (2DLC-MS/MS). The Gene Ontology(GO) function and the transcription regulation networks of the differentially expressed were analyzed by MetaCore software. Western blotting and Immunohistochemical staining were used to analyze the interesting proteins.
A total of 760 proteins were identified from 13787 distinct peptides, including two common proteins that enjoy clinical application: prostate specific antigen (PSA) and prostatic acid phosphatase(PAP). Proteins that expressed differentially between PCa and BPH group were further analyzed. Compared with BPH, 20 proteins were significantly differentially up-regulated (>1.5-fold) while 26 were significantly down-regulated in PCa(<0.66-fold). In term of GO database, the differentially expressed proteins were divided into 3 categories: cellular component(CC), molecular function (MF) and biological process(BP). The top 5 transcription regulation networks of the differentially expressed proteins were initiated through activation of SP1, p53, YY1, androgen receptor(AR) and c-Myc The overexpression of periostin in PCa was verified by western blotting and immunohistochemical staining.
Our study indicates that the iTRAQ technology is a new strategy for global proteomics analysis of the tissues of PCa. A significant up-regulation of periostin in PCa compared to BPH may provide clues for not only a promising biomarker for the prognosis of PCa but also a potential target for therapeutical intervention.
iTRAQ; Periostin; Proteomics; Prostate cancer