In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
An active area in cancer biomarker research is the development of statistical methods to identify expression signatures reflecting the heterogeneity of cancer across affected individuals. Tomlins et al.  observed heterogeneous patterns of oncogene activation within several cancer types, and introduced a statistical method called Cancer Outlier Profile Analysis (COPA) to identify “cancer outlier genes”. Several related statistical approaches have since been developed, but the operating characteristics of these procedures (e.g. power, false positive rate), have not yet been fully characterized, especially in a proteomics setting. Here, we use simulation to identify the degree to which an outlier pattern of differential expression must hold in order for outlier-based approaches to be more effective than mean-based approaches. We also propose a diagnostic procedure that characterizes the potentially unequal levels of differential expression in the tails and in the center of a distribution of expression values. We find that for sample sizes and effect sizes typical of proteomics studies, the outlier pattern must be strong in order for outlier-based analysis to provide a meaningful benefit. This is corroborated by analysis of proteomics data from a melanoma study, in which the differential expression is most often present throughout the distribution, rather than being concentrated in the tails, albeit with a few proteins showing expression patterns consistent with outlier expression.
Numerous nonparametric approaches have been proposed in literature to detect differential gene expression in the setting of two user-defined groups. However, there is a lack of nonparametric procedures to analyze microarray data with multiple factors attributing to the gene expression. Furthermore, incorporating interaction effects in the analysis of microarray data has long been of great interest to biological scientists, little of which has been investigated in the nonparametric framework.
In this paper, we propose a set of nonparametric tests to detect treatment effects, clinical covariate effects, and interaction effects for multifactorial microarray data. When the distribution of expression data is skewed or heavy-tailed, the rank tests are substantially more powerful than the competing parametric F tests. On the other hand, in the case of light or medium-tailed distributions, the rank tests appear to be marginally less powerful than the parametric competitors.
The proposed rank tests enable us to detect differential gene expression and establish interaction effects for microarray data with various non-normally distributed expression measurements across genome. In the presence of outliers, they are advantageous alternative approaches to the existing parametric F tests due to the robustness feature.
Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to bio-logical diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis- ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers.
We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non- outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs.
The proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers.
Cancer; Outlier; Differentially expressed genes; Microarray
Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.
We present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles.
By incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all.
Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset.
We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours.
We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers.
Cancer; Outliers; Expression data; Expression profile; Cluster; Subtype; Heterogeneous; Bioinformatics; Percentile; Feature selection
Genome-wide gene expression profiling has become standard for assessing potential liabilities as well as for elucidating mechanisms of toxicity of drug candidates under development. Analysis of microarray data is often challenging due to the lack of a statistical model that is amenable to biological variation in a small number of samples. Here we present a novel non-parametric algorithm that requires minimal assumptions about the data distribution. Our method for determining differential expression consists of two steps: 1) We apply a nominal threshold on fold change and platform p-value to designate whether a gene is differentially expressed in each treated and control sample relative to the averaged control pool, and 2) We compared the number of samples satisfying criteria in step 1 between the treated and control groups to estimate the statistical significance based on a null distribution established by sample permutations. The method captures group effect without being too sensitive to anomalies as it allows tolerance for potential non-responders in the treatment group and outliers in the control group. Performance and results of this method were compared with the Significant Analysis of Microarrays (SAM) method. These two methods were applied to investigate hepatic transcriptional responses of wild-type (PXR+/+) and pregnane X receptor-knockout (PXR−/−) mice after 96 h exposure to CMP013, an inhibitor of β-secretase (β-site of amyloid precursor protein cleaving enzyme 1 or BACE1). Our results showed that CMP013 led to transcriptional changes in hallmark PXR-regulated genes and induced a cascade of gene expression changes that explained the hepatomegaly observed only in PXR+/+ animals. Comparison of concordant expression changes between PXR+/+ and PXR−/− mice also suggested a PXR-independent association between CMP013 and perturbations to cellular stress, lipid metabolism, and biliary transport.
In much of the analysis of high-throughput genomic data, “interesting” genes have been selected based on assessment of differential expression between two groups or generalizations thereof. Most of the literature focuses on changes in mean expression or the entire distribution. In this article, we explore the use of C(α) tests, which have been applied in other genomic data settings. Their use for the outlier expression problem, in particular with continuous data, is problematic but nevertheless motivates new statistics that give an unsupervised analog to previously developed outlier profile analysis approaches. Some simulation studies are used to evaluate the proposal. A bivariate extension is described that can accommodate data from two platforms on matched samples. The proposed methods are applied to data from a prostate cancer study.
biomarkers; genomic data integration; heterogeneity; microarray; mixture model; tumor subtypes
The purpose of this study is to develop a statistical methodology to handle a large proportion of artifactual outliers in a population pharmacokinetic (PK) modeling. The motivating PK data were obtained from a population PK study to examine associations between PK parameters such as clearance of dexmedetomidine and cytochrome P450 2A6 phenotypes. The blood samples were sparsely sampled from patients in intensive care units (ICUs) while different doses of dexmedetomidine were continuously infused. Conventional population PK analysis of these data revealed several challenges and intricacies. Especially, there was strong evidence that some plasma drug concentrations were artifactually high and likely contaminated with the infused drug due to blood sampling processes that are sometimes unavoidable in an ICU setting. If not addressed, or if arbitrarily excluded, these outlying values could lead to biased estimates of PK parameters and miss important relationships between PK parameters and covariates due to increased variability. We propose a novel population PK model, a Bayesian hierarchical nonlinear mixture model, to accommodate the artifactual outliers using a finite mixture as the residual error model. Our results showed that the proposed model handles the outliers well. We also conducted simulation studies with a varying proportion of the outliers. These simulation results showed that the proposed model can accommodate the outliers well so that the estimated PK parameters are less biased.
finite mixture; outlier; nonlinear mixed effect model; pharmacogenetics; pharmacokinetics; NONMEM
Detection of differential gene expression using microarray technology has received considerable interest in cancer research studies. Recently, many researchers discovered that oncogenes may be activated in some but not all samples in a given disease group. The existing statistical tools for detecting differentially expressed genes in a subset of the disease group mainly include cancer outlier profile analysis (COPA), outlier sum (OS), outlier robust t-statistic (ORT) and maximum ordered subset t-statistics (MOST). In this study, another approach named Least Sum of Ordered Subset Square t-statistic (LSOSS) is proposed. The results of our simulation studies indicated that LSOSS often has more power than previous statistical methods. When applied to real human breast and prostate cancer data sets, LSOSS was competitive in terms of the biological relevance of top ranked genes. Furthermore, a modified hierarchical clustering method was developed to classify the heterogeneous gene activation patterns of human breast cancer samples based on the significant genes detected by LSOSS. Three classes of gene activation patterns, which correspond to estrogen receptor (ER)+, ER− and a mixture of ER+ and ER−, were detected and each class was assigned a different gene signature.
differential gene expression; cancer; outlier
Time course gene expression experiments are an increasingly popular method for exploring biological processes. Temporal gene expression profiles provide an important characterization of gene function, as biological systems are both developmental and dynamic. With such data it is possible to study gene expression changes over time and thereby to detect differential genes. Much of the early work on analyzing time series expression data relied on methods developed originally for static data and thus there is a need for improved methodology. Since time series expression is a temporal process, its unique features such as autocorrelation between successive points should be incorporated into the analysis.
This work aims to identify genes that show different gene expression profiles across time. We propose a statistical procedure to discover gene groups with similar profiles using a nonparametric representation that accounts for the autocorrelation in the data. In particular, we first represent each profile in terms of a Fourier basis, and then we screen out genes that are not differentially expressed based on the Fourier coefficients. Finally, we cluster the remaining gene profiles using a model-based approach in the Fourier domain. We evaluate the screening results in terms of sensitivity, specificity, FDR and FNR, compare with the Gaussian process regression screening in a simulation study and illustrate the results by application to yeast cell-cycle microarray expression data with alpha-factor synchronization.
The key elements of the proposed methodology: (i) representation of gene profiles in the Fourier domain; (ii) automatic screening of genes based on the Fourier coefficients and taking into account autocorrelation in the data, while controlling the false discovery rate (FDR); (iii) model-based clustering of the remaining gene profiles.
Using this method, we identified a set of cell-cycle-regulated time-course yeast genes. The proposed method is general and can be potentially used to identify genes which have the same patterns or biological processes, and help facing the present and forthcoming challenges of data analysis in functional genomics.
Meta-analysis typically involves combining the estimates from independent studies in order to estimate a parameter of interest across a population of studies. However, outliers often occur even under the random effects model. The presence of such outliers could substantially alter the conclusions in a meta-analysis. This paper proposes a methodology for identifying and, if desired, downweighting studies that do not appear representative of the population they are thought to represent under the random effects model.
An outlier is taken as an observation (study result) with an inflated random effect variance. We used the likelihood ratio test statistic as an objective measure for determining whether observations have inflated variance and are therefore considered outliers. A parametric bootstrap procedure was used to obtain the sampling distribution of the likelihood ratio test statistics and to account for multiple testing. Our methods were applied to three illustrative and contrasting meta-analytic data sets.
For the three meta-analytic data sets our methods gave robust inferences when the identified outliers were downweighted.
The proposed methodology provides a means to identify and, if desired, downweight outliers in meta-analysis. It does not eliminate them from the analysis however and we consider the proposed approach preferable to simply removing any or all apparently outlying results. We do not however propose that our methods in any way replace or diminish the standard random effects methodology that has proved so useful, rather they are helpful when used in conjunction with the random effects model.
The distribution of genetic variation among populations is conveniently measured by Wright’s FST, which is a scaled variance taking on values in [0,1]. For certain types of genetic markers, and for single-nucleotide polymorphisms (SNPs) in particular, it is reasonable to presume that allelic differences at most loci are selectively neutral. For such loci, the distribution of genetic variation among populations is determined by the size of local populations, the pattern and rate of migration among those populations, and the rate of mutation. Because the demographic parameters (population sizes and migration rates) are common across all autosomal loci, locus-specific estimates of FST will depart from a common distribution only for loci with unusually high or low rates of mutation or for loci that are closely associated with genomic regions having a relationship with fitness. Thus, loci that are statistical outliers showing significantly more among-population differentiation than others may mark genomic regions subject to diversifying selection among the sample populations. Similarly, statistical outliers showing significantly less differentiation among populations than others may mark genomic regions subject to stabilizing selection across the sample populations. We propose several Bayesian hierarchical models to estimate locus-specific effects on FST, and we apply these models to single nucleotide polymorphism data from the HapMap project. Because loci that are physically associated with one another are likely to show similar patterns of variation, we introduce conditional autoregressive models to incorporate the local correlation among loci for high-resolution genomic data. We estimate the posterior distributions of model parameters using Markov chain Monte Carlo (MCMC) simulations. Model comparison using several criteria, including DIC and LPML, reveals that a model with locus- and population-specific effects is superior to other models for the data used in the analysis. To detect statistical outliers we propose an approach that measures divergence between the posterior distributions of locus-specific effects and the common FST with the Kullback-Leibler divergence measure. We calibrate this measure by comparing values with those produced from the divergence between a biased and a fair coin. We conduct a simulation study to illustrate the performance of our approach for detecting loci subject to stabilizing/divergent selection, and we apply the proposed models to low- and high-resolution SNP data from the HapMap project. Model comparison using DIC and LPML reveals that CAR models are superior to alternative models for the high resolution data. For both low and high resolution data, we identify statistical outliers that are associated with known genes.
Bayesian approach; Hierarchical model; SNP; Wright’s Fst; MCMC
Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own.
We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Different data types can offer complementary perspectives on the same biological phenomenon. In cancer studies, for example, data on copy number alterations indicate losses and amplifications of genomic regions in tumours, while transcriptomic data point to the impact of genomic and environmental events on the internal wiring of the cell. Fusing different data provides a more comprehensive model of the cancer cell than that offered by any single type. However, biological signals in different patients exhibit diverse degrees of concordance due to cancer heterogeneity and inherent noise in the measurements. This is a particularly important issue in cancer subtype discovery, where personalised strategies to guide therapy are of vital importance. We present a nonparametric Bayesian model for discovering prognostic cancer subtypes by integrating gene expression and copy number variation data. Our model is constructed from a hierarchy of Dirichlet Processes and addresses three key challenges in data fusion: (i) To separate concordant from discordant signals, (ii) to select informative features, (iii) to estimate the number of disease subtypes. Concordance of signals is assessed individually for each patient, giving us an additional level of insight into the underlying disease structure. We exemplify the power of our model in prostate cancer and breast cancer and show that it outperforms competing methods. In the prostate cancer data, we identify an entirely new subtype with extremely poor survival outcome and show how other analyses fail to detect it. In the breast cancer data, we find subtypes with superior prognostic value by using the concordant results. These discoveries were crucially dependent on our model's ability to distinguish concordant and discordant signals within each patient sample, and would otherwise have been missed. We therefore demonstrate the importance of taking a patient-specific approach, using highly-flexible nonparametric Bayesian methods.
The goal of personalised medicine is to develop accurate diagnostic tests that identify patients who can benefit from targeted therapies. To achieve this goal it is necessary to stratify cancer patients into homogeneous subtypes according to which molecular aberrations their tumours exhibit. Prominent approaches for subtype definition combine information from different molecular levels, for example data on DNA copy number changes with data on mRNA expression changes. This is called data fusion. We contribute to this field by proposing a unified model that fuses different data types, finds informative features and estimates the number of subtypes in the data. The main strength of our model comes from the fact that we assess for each patient whether the different data agree on a subtype or not. Competing methods combine the data without checking for concordance of signals. On a breast cancer and a prostate cancer data set we show that concordance of signals has strong influence on subtype definition and that our model allows to define prognostic subtypes that would have been missed otherwise.
We propose a Bayesian method for multiple hypothesis testing in random effects models that uses Dirichlet process (DP) priors for a nonparametric treatment of the random effects distribution. We consider a general model formulation which accommodates a variety of multiple treatment conditions. A key feature of our method is the use of a product of spiked distributions, i.e., mixtures of a point-mass and continuous distributions, as the centering distribution for the DP prior. Adopting these spiked centering priors readily accommodates sharp null hypotheses and allows for the estimation of the posterior probabilities of such hypotheses. Dirichlet process mixture models naturally borrow information across objects through model-based clustering while inference on single hypotheses averages over clustering uncertainty. We demonstrate via a simulation study that our method yields increased sensitivity in multiple hypothesis testing and produces a lower proportion of false discoveries than other competitive methods. While our modeling framework is general, here we present an application in the context of gene expression from microarray experiments. In our application, the modeling framework allows simultaneous inference on the parameters governing differential expression and inference on the clustering of genes. We use experimental data on the transcriptional response to oxidative stress in mouse heart muscle and compare the results from our procedure with existing nonparametric Bayesian methods that provide only a ranking of the genes by their evidence for differential expression.
Bayesian nonparametrics; differential gene expression; Dirichlet process prior; DNA microarray; mixture priors; model-based clustering; multiple hypothesis testing
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.
Box-Cox transformation; Clustering; EM algorithm; Outliers; Robustness; Skewness
Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from MicroArray Quality Control (MAQC) project.
Genewise variance estimation; gene selection; local linear regression; nonparametric model; correlation correction; validation test
Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets.
Adaptive LASSO; Bayesian methods; False discovery rate; Functional Data Analysis; Mixed models; Robust regression; Scale mixtures of normals; Sparsity Priors; Variable Selection; Wavelets
Microarray data enables the high-throughput survey of mRNA expression profiles at the genomic level; however, the data presents a challenging statistical problem because of the large number of transcripts with small sample sizes that are obtained. To reduce the dimensionality, various Bayesian or empirical Bayes hierarchical models have been developed. However, because of the complexity of the microarray data, no model can explain the data fully. It is generally difficult to scrutinize the irregular patterns of expression that are not expected by the usual statistical gene by gene models.
As an extension of empirical Bayes (EB) procedures, we have developed the β-empirical Bayes (β-EB) approach based on a β-likelihood measure which can be regarded as an ’evidence-based’ weighted (quasi-) likelihood inference. The weight of a transcript t is described as a power function of its likelihood, fβ(yt|θ). Genes with low likelihoods have unexpected expression patterns and low weights. By assigning low weights to outliers, the inference becomes robust. The value of β, which controls the balance between the robustness and efficiency, is selected by maximizing the predictive β0-likelihood by cross-validation. The proposed β-EB approach identified six significant (p<10−5) contaminated transcripts as differentially expressed (DE) in normal/tumor tissues from the head and neck of cancer patients. These six genes were all confirmed to be related to cancer; they were not identified as DE genes by the classical EB approach. When applied to the eQTL analysis of Arabidopsis thaliana, the proposed β-EB approach identified some potential master regulators that were missed by the EB approach.
The simulation data and real gene expression data showed that the proposed β-EB method was robust against outliers. The distribution of the weights was used to scrutinize the irregular patterns of expression and diagnose the model statistically. When β-weights outside the range of the predicted distribution were observed, a detailed inspection of the data was carried out. The β-weights described here can be applied to other likelihood-based statistical models for diagnosis, and may serve as a useful tool for transcriptome and proteome studies.
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology
23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One
2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics
63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics
9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics
86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in analysis of longitudinal data. Both involve estimation of the covariance function. Yet, challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function is proposed by imposing a parametric correlation structure while allowing a nonparametric variance function. A kernel estimator is developed for the estimation of the nonparametric variance function. Two methods, a quasi-likelihood approach and a minimum generalized variance method, are proposed for estimating parameters in the correlation structure. We introduce a semiparametric varying coefficient partially linear model for longitudinal data and propose an estimation procedure for model coefficients by using a profile weighted least squares approach. Sampling properties of the proposed estimation procedures are studied and asymptotic normality of the resulting estimators is established. Finite sample performance of the proposed procedures is assessed by Monte Carlo simulation studies. The proposed methodology is illustrated by an analysis of a real data example.
Kernel regression; local linear regression; profile weighted least squares; semiparametric varying coefficient model
Outlier detection and environmental association analysis are common methods to search for loci or genomic regions exhibiting signals of adaptation to environmental factors. However, a validation of outlier loci and corresponding allele distribution models through functional molecular biology or transplant/common garden experiments is rarely carried out. Here, we employ another method for validation, namely testing outlier loci in specifically designed, independent data sets. Previously, an outlier locus associated with three different habitat types had been detected in Arabis alpina. For the independent validation data set, we sampled 30 populations occurring in these three habitat types across five biogeographic regions of the Swiss Alps. The allele distribution model found in the original study could not be validated in the independent test data set: The outlier locus was no longer indicative of habitat-mediated selection. We propose several potential causes of this failure of validation, of which unaccounted genetic structure and technical issues in the original data set used to detect the outlier locus were most probable. Thus, our study shows that validating outlier loci and allele distribution models in independent data sets is a helpful tool in ecological genomics which, in the case of positive validation, adds confidence to outlier loci and their association with environmental factors or, in the case of failure of validation, helps to explain inconsistencies.
Allele distribution model; environmental association analysis; genome scan; natural selection; SNaPshot®
Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue.
The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd = 2.7%), 97.6% (sd = 2.8%) and 90.8% (sd = 5.5%) and average specificities of: 93.6% (sd = 4.1%), 99% (sd = 2.2%) and 79.4% (sd = 9.8%) in 100 independent two-fold cross-validations.
We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy.
Meta-analysis of gene expression microarray datasets presents significant challenges for statistical analysis. We developed and validated a new bioinformatic method for the identification of genes upregulated in subsets of samples of a given tumour type (‘outlier genes’), a hallmark of potential oncogenes.
A new statistical method (the gene tissue index, GTI) was developed by modifying and adapting algorithms originally developed for statistical problems in economics. We compared the potential of the GTI to detect outlier genes in meta-datasets with four previously defined statistical methods, COPA, the OS statistic, the t-test and ORT, using simulated data. We demonstrated that the GTI performed equally well to existing methods in a single study simulation. Next, we evaluated the performance of the GTI in the analysis of combined Affymetrix gene expression data from several published studies covering 392 normal samples of tissue from the central nervous system, 74 astrocytomas, and 353 glioblastomas. According to the results, the GTI was better able than most of the previous methods to identify known oncogenic outlier genes. In addition, the GTI identified 29 novel outlier genes in glioblastomas, including TYMS and CDKN2A. The over-expression of these genes was validated in vivo by immunohistochemical staining data from clinical glioblastoma samples. Immunohistochemical data were available for 65% (19 of 29) of these genes, and 17 of these 19 genes (90%) showed a typical outlier staining pattern. Furthermore, raltitrexed, a specific inhibitor of TYMS used in the therapy of tumour types other than glioblastoma, also effectively blocked cell proliferation in glioblastoma cell lines, thus highlighting this outlier gene candidate as a potential therapeutic target.
Taken together, these results support the GTI as a novel approach to identify potential oncogene outliers and drug targets. The algorithm is implemented in an R package (Text S1).