High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation.
High dimensional data, especially those emerging from genomics and proteomics studies, pose significant challenges to traditional classification algorithms because the performance of these algorithms may substantially deteriorate due to high dimensionality and existence of many noisy features in these data. To address these problems, pre-classification feature selection and aggregating algorithms have been proposed. However, most feature selection procedures either fail to consider potential interactions among the features or tend to over fit the data. The aggregating algorithms, e.g. the bagging predictor, the boosting algorithm, the random subspace method, and the Random Forests algorithm, are promising in handling high dimensional data. However, there is a lack of attention to optimal weight assignments to individual classifiers and this has prevented these algorithms from achieving better classification accuracy. In this article, we formulate the weight assignment problem and propose a heuristic optimization solution.
We have applied the proposed weight assignment procedures to the random subspace method to develop a weighted random subspace method. Several public gene expression and mass spectrometry data sets at the Kent Ridge biomedical data repository have been analyzed by this novel method. We have found that significant improvement over the common equal weight assignment scheme may be achieved by our method.
Classification; Aggregating algorithm; Voting weight; Random subspace projection
The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder.
Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables.
Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation.
New high-throughput sequencing technologies have brought forth opportunities for unbiased analysis of thousands of rare genomic variants in genome-wide association studies of complex diseases. Because it is hard to detect single rare variants with appreciable effect sizes at the population level, existing methods mostly aggregate effects of multiple markers by collapsing the rare variants in genes (or genomic regions). We hypothesize that a higher level of aggregation can further improve association signal strength. Using the Genetic Analysis Workshop 17 simulated data, we test a two-step strategy that first applies a collapsing method in a gene-level analysis and then aggregates the gene-level test results by performing an enrichment analysis in gene sets. We find that the gene set approach which combines signals across multiple genes outperforms testing individual genes separately and that the power of the gene set enrichment test is further improved by proper adjustment of statistics to account for gene-wise differences.
The ‘omics’ approaches – genomics, proteomics and metabolomics – are based on high-throughput, high-information-content analysis. Using these approaches, as opposed to targeting one or a few analytes, a holistic understanding of the composition of a sample can be obtained. These approaches have revolutionized sample-analysis and data-processing protocols. In metabolomic studies, hundreds of small molecules are simultaneously analyzed using analytical platforms (e.g., gas chromatography-mass spectrometry (GC-MS) or liquid chromatography coupled to tandem mass spectrometry (LC-MS2)). This philosophy of holistic analysis and the application of high-throughput, high-information-content analysis offer several advantages. In this article, we compare the conventional analytical approach of one or a few analytes per sample to the LC-MS2-based metabolomics-type approach in the context of pharmaceutical and environmental analysis.
Chromatography; Environmental; LC-MS2; Metabolomics; Pharmaceutical
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.
In this paper, we propose a recursive gene selection method using the discriminant vector of the maximum margin criterion (MMC), which is a variant of classical linear discriminant analysis (LDA). To overcome the computational drawback of classical LDA and the problem of high dimensionality, we present efficient and stable algorithms for MMC-based RFE (MMC-RFE). The MMC-RFE algorithms naturally extend to multi-class cases. The performance of MMC-RFE was extensively compared with that of SVM-RFE using nine cancer microarray datasets, including four multi-class datasets.
Our extensive comparison has demonstrated that for binary-class datasets MMC-RFE tends to show intermediate performance between hard-margin SVM-RFE and SVM-RFE with a properly chosen soft-margin parameter. Notably, MMC-RFE achieves significantly better performance with a smaller number of genes than SVM-RFE for multi-class datasets. The results suggest that MMC-RFE is less sensitive to noise and outliers due to the use of average margin, and thus may be useful for biomarker discovery from noisy data.
Metabolomic data analysis becomes increasingly challenging when dealing with clinical samples with diverse demographic and genetic backgrounds and various pathological conditions or treatments. Although many classification tools, such as projection to latent structures (PLS), support vector machine (SVM), linear discriminant analysis (LDA), and random forest (RF), have been successfully used in metabolomics, their performance including strengths and limitations in clinical data analysis has not been clear to researchers due to the lack of systematic evaluation of these tools. In this paper we comparatively evaluated the four classifiers, PLS, SVM, LDA, and RF, in the analysis of clinical metabolomic data derived from gas chromatography mass spectrometry platform of healthy subjects and patients diagnosed with colorectal cancer, where cross-validation, R2/Q2 plot, receiver operating characteristic curve, variable reduction, and Pearson correlation were performed. RF outperforms the other three classifiers in the given clinical data sets, highlighting its comparative advantages as a suitable classification and biomarker selection tool for clinical metabolomic data analysis.
Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.
The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.
We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
Mass spectrometry-based untargeted metabolomics often results in the observation of hundreds to thousands of features that are differentially regulated between sample classes. A major challenge in interpreting the data is distinguishing metabolites that are causally associated with the phenotype of interest from those that are unrelated but altered in downstream pathways as an effect. To facilitate this distinction, here we describe new software called metaXCMS for performing second-order (“meta”) analysis of untargeted metabolomics data from multiple sample groups representing different models of the same phenotype. While the current version of XCMS was designed for the direct comparison of two sample groups, metaXCMS enables meta-analysis of an unlimited number of sample classes to facilitate prioritization of the data and increase the probability of identifying metabolites causally related to the phenotype of interest. metaXCMS is used to import XCMS results that are subsequently filtered, realigned, and ultimately compared to identify shared metabolites that are up- or down-regulated across all sample groups. We demonstrate the software’s utility by identifying histamine as a metabolite that is commonly altered in three different models of pain. metaXCMS is freely available at http://metlin.scripps.edu/metaxcms/.
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ≫ n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher’s discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L1 and fused lasso penalties. Our proposal is equivalent to recasting Fisher’s discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.
classification; feature selection; high dimensional; lasso; linear discriminant analysis; supervised learning
Metabolomics is an emerging field that is based on the quantitative measurement of as many small organic molecules occurring in a biological sample as possible. Due to recent technical advances, metabolomics can now be used widely as an analytical high-throughput technology in drug testing and epidemiological metabolome and genome wide association studies. Analogous to chip-based gene expression analyses, the enormous amount of data produced by modern kit-based metabolomics experiments poses new challenges regarding their biological interpretation in the context of various sample phenotypes. We developed metaP-server to facilitate data interpretation. metaP-server provides automated and standardized data analysis for quantitative metabolomics data, covering the following steps from data acquisition to biological interpretation: (i) data quality checks, (ii) estimation of reproducibility and batch effects, (iii) hypothesis tests for multiple categorical phenotypes, (iv) correlation tests for metric phenotypes, (v) optionally including all possible pairs of metabolite concentration ratios, (vi) principal component analysis (PCA), and (vii) mapping of metabolites onto colored KEGG pathway maps. Graphical output is clickable and cross-linked to sample and metabolite identifiers. Interactive coloring of PCA and bar plots by phenotype facilitates on-line data exploration. For users of commercial metabolomics kits, cross-references to the HMDB, LipidMaps, KEGG, PubChem, and CAS databases are provided. metaP-server is freely accessible at http://metabolomics.helmholtz-muenchen.de/metap2/.
Motivation: The availability of modern sequencing techniques has led to a rapid increase in the amount of reconstructed metabolic networks. Using these models as a platform for the analysis of high throughput transcriptomic, proteomic and metabolomic data can provide valuable insight into conditional changes in the metabolic activity of an organism. While transcriptomics and proteomics provide important insights into the hierarchical regulation of metabolic flux, metabolomics shed light on the actual enzyme activity through metabolic regulation and mass action effects. Here we introduce a new method, termed integrative omics-metabolic analysis (IOMA) that quantitatively integrates proteomic and metabolomic data with genome-scale metabolic models, to more accurately predict metabolic flux distributions. The method is formulated as a quadratic programming (QP) problem that seeks a steady-state flux distribution in which flux through reactions with measured proteomic and metabolomic data, is as consistent as possible with kinetically derived flux estimations.
Results: IOMA is shown to successfully predict the metabolic state of human erythrocytes (compared to kinetic model simulations), showing a significant advantage over the commonly used methods flux balance analysis and minimization of metabolic adjustment. Thereafter, IOMA is shown to correctly predict metabolic fluxes in Escherichia coli under different gene knockouts for which both metabolomic and proteomic data is available, achieving higher prediction accuracy over the extant methods. Considering the lack of high-throughput flux measurements, while high-throughput metabolomic and proteomic data are becoming readily available, we expect IOMA to significantly contribute to future research of cellular metabolism.
Contacts: firstname.lastname@example.org; email@example.com
Metabolomics is an ‘omics’ approach that aims to analyze all metabolites in a biological sample comprehensively. The detailed metabolite profiling of thousands of plant samples has great potential for directly elucidating plant metabolic processes. However, both a comprehensive analysis and a high throughput are difficult to achieve at the same time due to the wide diversity of metabolites in plants. Here, we have established a novel and practical metabolomics methodology for quantifying hundreds of targeted metabolites in a high-throughput manner. Multiple reaction monitoring (MRM) using tandem quadrupole mass spectrometry (TQMS), which monitors both the specific precursor ions and product ions of each metabolite, is a standard technique in targeted metabolomics, as it enables high sensitivity, reproducibility and a broad dynamic range. In this study, we optimized the MRM conditions for specific compounds by performing automated flow injection analyses with TQMS. Based on a total of 61,920 spectra for 860 authentic compounds, the MRM conditions of 497 compounds were successfully optimized. These were applied to high-throughput automated analysis of biological samples using TQMS coupled with ultra performance liquid chromatography (UPLC). By this analysis, approximately 100 metabolites were quantified in each of 14 plant accessions from Brassicaceae, Gramineae and Fabaceae. A hierarchical cluster analysis based on the metabolite accumulation patterns clearly showed differences among the plant families, and family-specific metabolites could be predicted using a batch-learning self-organizing map analysis. Thus, the automated widely targeted metabolomics approach established here should pave the way for large-scale metabolite profiling and comparative metabolomics.
Automated; Comparative metabolomics; Liquid chromatography–mass spectrometry; Metabolite accumulation pattern; Multiple reaction monitoring; Targeted metabolomics
We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge.
The efforts of previous National Science Foundation (NSF) projects provided for the generation of parallel modules for widely used bioinformatics applications on the Kraken supercomputer. We have profiled and optimized the code of some of the scientific community's most widely used desktop and small-cluster-based applications, including BLAST from the National Center for Biotechnology Information (NCBI), HMMER, and MUSCLE; scaled them to tens of thousands of cores on high-performance computing (HPC) architectures; made them robust and portable to next-generation architectures; and incorporated these parallel applications in science gateways with a web-based portal.
This paper will discuss the various developmental stages, challenges, and solutions involved in taking bioinformatics applications from the desktop to petascale with a front-end portal for very-large-scale data analysis in the life sciences.
This research will help to bridge the gap between the rate of data generation and the speed at which scientists can study this data. The ability to rapidly analyze data at such a large scale is having a significant, direct impact on science achieved by collaborators who are currently using these tools on supercomputers.
Motivation: With the availability of many ‘omics’ data, such as transcriptomics, proteomics or metabolomics, the integrative or joint analysis of multiple datasets from different technology platforms is becoming crucial to unravel the relationships between different biological functional levels. However, the development of such an analysis is a major computational and technical challenge as most approaches suffer from high data dimensionality. New methodologies need to be developed and validated.
Results: integrOmics efficiently performs integrative analyses of two types of ‘omics’ variables that are measured on the same samples. It includes a regularized version of canonical correlation analysis to enlighten correlations between two datasets, and a sparse version of partial least squares (PLS) regression that includes simultaneous variable selection in both datasets. The usefulness of both approaches has been demonstrated previously and successfully applied in various integrative studies.
Availability: integrOmics is freely available from http://CRAN.R-project.org/ or from the web site companion (http://math.univ-toulouse.fr/biostat) that provides full documentation and tutorials.
Supplementary information: Supplementary data are available at Bioinformatics online.
This paper presents a novel classification via aggregated regression algorithm – dubbed CAVIAR – and its application to the OASIS MRI brain image database. The CAVIAR algorithm simultaneously combines a set of weak learners based on the assumption that the weight combination for the final strong hypothesis in CAVIAR depends on both the weak learners and the training data. A regularization scheme using the nearest neighbor method is imposed in the testing stage to avoid overfitting. A closed form solution to the cost function is derived for this algorithm. We use a novel feature – the histogram of the deformation field between the MRI brain scan and the atlas which captures the structural changes in the scan with respect to the atlas brain – and this allows us to automatically discriminate between various classes within OASIS  using CAVIAR. We empirically show that CAVIAR significantly increases the performance of the weak classifiers by showcasing the performance of our technique on OASIS.
aggregated regression; classifier ensemble; OASIS; dementia
Protein aggregation is the phenomenon of protein self-association potentially leading to detrimental effects on physiology, which is closely related to numerous human diseases such as Alzheimer's and Parkinson's disease. Despite progress in understanding the mechanism of protein aggregation, how natural selection against protein aggregation acts on subunits of protein complexes and on proteins with different contributions to organism fitness remains largely unknown. Here, we perform a proteome-wide analysis by using an experimentally validated algorithm TANGO and utilizing sequence, interactomic and phenotype-based functional genomic data from yeast, fly, and nematode. We find that proteins that are capable of forming homooligomeric complex have lower aggregation propensity compared with proteins that do not function as homooligomer. Further, proteins that are essential to the fitness of an organism have lower aggregation propensity compared with nonessential ones. Our finding suggests that the selection force against protein aggregation acts across different hierarchies of biological system.
natural selection; protein aggregation; functional genomics; proteome; Drosophila melanogaster; Caenorhabditis elegans
Linear discriminant analysis (LDA) is one of the most popular classification algorithms for brain-computer interfaces (BCI). LDA assumes Gaussian distribution of the data, with equal covariance matrices for the concerned classes, however, the assumption is not usually held in actual BCI applications, where the heteroscedastic class distributions are usually observed. This paper proposes an enhanced version of LDA, namely z-score linear discriminant analysis (Z-LDA), which introduces a new decision boundary definition strategy to handle with the heteroscedastic class distributions. Z-LDA defines decision boundary through z-score utilizing both mean and standard deviation information of the projected data, which can adaptively adjust the decision boundary to fit for heteroscedastic distribution situation. Results derived from both simulation dataset and two actual BCI datasets consistently show that Z-LDA achieves significantly higher average classification accuracies than conventional LDA, indicating the superiority of the new proposed decision boundary definition strategy.
With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.
Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.
The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.
The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.
More studies based on gene expression data have been reported in great detail, however, one major challenge for the methodologists is the choice of classification methods. The main purpose of this research was to compare the performance of linear discriminant analysis (LDA) and its modification methods for the classification of cancer based on gene expression data.
The classification performance of linear discriminant analysis (LDA) and its modification methods was evaluated by applying these methods to six public cancer gene expression datasets. These methods included linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), shrinkage centroid regularized discriminant analysis (SCRDA), shrinkage linear discriminant analysis (SLDA) and shrinkage diagonal discriminant analysis (SDDA). The procedures were performed by software R 2.80.
PAM picked out fewer feature genes than other methods from most datasets except from Brain dataset. For the two methods of shrinkage discriminant analysis, SLDA selected more genes than SDDA from most datasets except from 2-class lung cancer dataset. When comparing SLDA with SCRDA, SLDA selected more genes than SCRDA from 2-class lung cancer, SRBCT and Brain dataset, the result was opposite for the rest datasets. The average test error of LDA modification methods was lower than LDA method.
The classification performance of LDA modification methods was superior to that of traditional LDA with respect to the average error and there was no significant difference between theses modification methods.
It has been unclear whether optimal experimental design accounts of data selection may offer insight into evidence acquisition tasks in which the learner's beliefs change greatly during the course of learning. Data from Rehder and Hoffman's eye movement version of Shepard, Horland and Jenkins’ classic concept learning task provide an opportunity to address these issues. We introduce a principled probabilistic concept-learning model that describes the development of subjects’ beliefs on that task. We use that learning model, together with a sampling function inspired by theory of optimal experimental design, to predict subjects’ eye movements on the active learning version of that task. Results show that the same rational sampling function can predict eye movements early in learning, when uncertainty is high, as well as late in learning when the learner is certain of the true category.
Saccadic eye movement; Information theory; Bayesian reasoning; Concept formation; Optimal experimental design; Intuitive statistics
Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied.
We introduce the R function collapseRows that implements several collapsing methods and evaluate its performance in three applications. First, we study a crucial step of the meta-analysis of microarray data: the merging of independent gene expression data sets, which may have been measured on different platforms. Toward this end, we collapse multiple microarray probes for a single gene and then merge the data by gene identifier. We find that choosing the probe with the highest average expression leads to best between-study consistency. Second, we study methods for summarizing the gene expression profiles of a co-expression module. Several gene co-expression network analysis applications show that the optimal collapsing strategy depends on the analysis goal. Third, we study aggregating the information of cell type marker genes when the aim is to predict the abundance of cell types in a tissue sample based on gene expression data ("expression deconvolution"). We apply different collapsing methods to predict cell type abundances in peripheral human blood and in mixtures of blood cell lines. Interestingly, the most accurate prediction method involves choosing the most highly connected "hub" marker gene. Finally, to facilitate biological interpretation of collapsed gene lists, we introduce the function userListEnrichment, which assesses the enrichment of gene lists for known brain and blood cell type markers, and for other published biological pathways.
The R function collapseRows implements several standard and network-based collapsing methods. In various genomic applications we provide evidence that both types of methods are robust and biologically relevant tools.
Metabolomics, or metabonomics, refers to the quantitative analysis of all metabolites present within a biological sample and is generally carried out using NMR spectroscopy or Mass Spectrometry. Such analysis produces a set of peaks, or features, indicative of the metabolic composition of the sample and may be used as a basis for sample classification. Feature selection may be employed to improve classification accuracy or aid model explanation by establishing a subset of class discriminating features. Factors such as experimental noise, choice of technique and threshold selection may adversely affect the set of selected features retrieved. Furthermore, the high dimensionality and multi-collinearity inherent within metabolomics data may exacerbate discrepancies between the set of features retrieved and those required to provide a complete explanation of metabolite signatures. Given these issues, the latter in particular, we present the MetaFIND application for 'post-feature selection' correlation analysis of metabolomics data.
In our evaluation we show how MetaFIND may be used to elucidate metabolite signatures from the set of features selected by diverse techniques over two metabolomics datasets. Importantly, we also show how MetaFIND may augment standard feature selection and aid the discovery of additional significant features, including those which represent novel class discriminating metabolites. MetaFIND also supports the discovery of higher level metabolite correlations.
Standard feature selection techniques may fail to capture the full set of relevant features in the case of high dimensional, multi-collinear metabolomics data. We show that the MetaFIND 'post-feature selection' analysis tool may aid metabolite signature elucidation, feature discovery and inference of metabolic correlations.
The study of the human DNA methylome has gained particular interest in the last few years. Researchers can nowadays investigate the potential role of DNA methylation in common disorders by taking advantage of new high-throughput technologies. Among these, Illumina Infinium assays can interrogate the methylation levels of hundreds of thousands of CpG sites, offering an ideal solution for genome-wide methylation profiling. However, like for other high-throughput technologies, the main bottleneck remains at the stage of data analysis rather than data production.
We have developed HumMeth27QCReport, an R package devoted to researchers wanting to quickly analyse their Illumina Infinium methylation arrays. This package automates quality control steps by generating a report including sample-independent and sample-dependent quality plots, and performs primary analysis of raw methylation calls by computing data normalization, statistics, and sample similarities. This package is available at CRAN repository, and can be integrated in any Galaxy instance through the implementation of ad-hoc scripts accessible at Galaxy Tool Shed.
Our package provides users of the Illumina Infinium Methylation assays with a simplified, automated, open-source quality control and primary analysis of their methylation data. Moreover, to enhance its use by experimental researchers, the tool is being distributed along with the scripts necessary for its implementation in the Galaxy workbench. Finally, although it was originally developed for HumanMethylation27, we proved its compatibility with data generated with the HumanMethylation450 Bead Chip.