In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
An active area in cancer biomarker research is the development of statistical methods to identify expression signatures reflecting the heterogeneity of cancer across affected individuals. Tomlins et al.  observed heterogeneous patterns of oncogene activation within several cancer types, and introduced a statistical method called Cancer Outlier Profile Analysis (COPA) to identify “cancer outlier genes”. Several related statistical approaches have since been developed, but the operating characteristics of these procedures (e.g. power, false positive rate), have not yet been fully characterized, especially in a proteomics setting. Here, we use simulation to identify the degree to which an outlier pattern of differential expression must hold in order for outlier-based approaches to be more effective than mean-based approaches. We also propose a diagnostic procedure that characterizes the potentially unequal levels of differential expression in the tails and in the center of a distribution of expression values. We find that for sample sizes and effect sizes typical of proteomics studies, the outlier pattern must be strong in order for outlier-based analysis to provide a meaningful benefit. This is corroborated by analysis of proteomics data from a melanoma study, in which the differential expression is most often present throughout the distribution, rather than being concentrated in the tails, albeit with a few proteins showing expression patterns consistent with outlier expression.
A local modal estimation procedure is proposed for the regression function in a non-parametric regression model. A distinguishing characteristic of the proposed procedure is that it introduces an additional tuning parameter that is automatically selected using the observed data in order to achieve both robustness and efficiency of the resulting estimate. We demonstrate both theoretically and empirically that the resulting estimator is more efficient than the ordinary local polynomial regression estimator in the presence of outliers or heavy tail error distribution (such as t-distribution). Furthermore, we show that the proposed procedure is as asymptotically efficient as the local polynomial regression estimator when there are no outliers and the error distribution is a Gaussian distribution. We propose an EM type algorithm for the proposed estimation procedure. A Monte Carlo simulation study is conducted to examine the finite sample performance of the proposed method. The simulation results confirm the theoretical findings. The proposed methodology is further illustrated via an analysis of a real data example.
Adaptive regression; Local polynomial regression; M-estimator; Modal regression; Robust nonparametric regression
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.
Box-Cox transformation; Clustering; EM algorithm; Outliers; Robustness; Skewness
Detection of differential gene expression using microarray technology has received considerable interest in cancer research studies. Recently, many researchers discovered that oncogenes may be activated in some but not all samples in a given disease group. The existing statistical tools for detecting differentially expressed genes in a subset of the disease group mainly include cancer outlier profile analysis (COPA), outlier sum (OS), outlier robust t-statistic (ORT) and maximum ordered subset t-statistics (MOST). In this study, another approach named Least Sum of Ordered Subset Square t-statistic (LSOSS) is proposed. The results of our simulation studies indicated that LSOSS often has more power than previous statistical methods. When applied to real human breast and prostate cancer data sets, LSOSS was competitive in terms of the biological relevance of top ranked genes. Furthermore, a modified hierarchical clustering method was developed to classify the heterogeneous gene activation patterns of human breast cancer samples based on the significant genes detected by LSOSS. Three classes of gene activation patterns, which correspond to estrogen receptor (ER)+, ER− and a mixture of ER+ and ER−, were detected and each class was assigned a different gene signature.
differential gene expression; cancer; outlier
Meta-analysis typically involves combining the estimates from independent studies in order to estimate a parameter of interest across a population of studies. However, outliers often occur even under the random effects model. The presence of such outliers could substantially alter the conclusions in a meta-analysis. This paper proposes a methodology for identifying and, if desired, downweighting studies that do not appear representative of the population they are thought to represent under the random effects model.
An outlier is taken as an observation (study result) with an inflated random effect variance. We used the likelihood ratio test statistic as an objective measure for determining whether observations have inflated variance and are therefore considered outliers. A parametric bootstrap procedure was used to obtain the sampling distribution of the likelihood ratio test statistics and to account for multiple testing. Our methods were applied to three illustrative and contrasting meta-analytic data sets.
For the three meta-analytic data sets our methods gave robust inferences when the identified outliers were downweighted.
The proposed methodology provides a means to identify and, if desired, downweight outliers in meta-analysis. It does not eliminate them from the analysis however and we consider the proposed approach preferable to simply removing any or all apparently outlying results. We do not however propose that our methods in any way replace or diminish the standard random effects methodology that has proved so useful, rather they are helpful when used in conjunction with the random effects model.
Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from MicroArray Quality Control (MAQC) project.
Genewise variance estimation; gene selection; local linear regression; nonparametric model; correlation correction; validation test
Motivated by an analysis of a real data set in ecology, we consider a class of partially nonlinear models where both of a nonparametric component and a parametric component present. We develop two new estimation procedures to estimate the parameters in the parametric component. Consistency and asymptotic normality of the resulting estimators are established. We further propose an estimation procedure and a generalized F test procedure for the nonparametric component in the partially nonlinear models. Asymptotic properties of the newly proposed estimation procedure and the test statistic are derived. Finite sample performance of the proposed inference procedures are assessed by Monte Carlo simulation studies. An application in ecology is used to illustrate the proposed methods.
Local linear regression; partial linear models; profile least squares; semiparametric models
Numerous nonparametric approaches have been proposed in literature to detect differential gene expression in the setting of two user-defined groups. However, there is a lack of nonparametric procedures to analyze microarray data with multiple factors attributing to the gene expression. Furthermore, incorporating interaction effects in the analysis of microarray data has long been of great interest to biological scientists, little of which has been investigated in the nonparametric framework.
In this paper, we propose a set of nonparametric tests to detect treatment effects, clinical covariate effects, and interaction effects for multifactorial microarray data. When the distribution of expression data is skewed or heavy-tailed, the rank tests are substantially more powerful than the competing parametric F tests. On the other hand, in the case of light or medium-tailed distributions, the rank tests appear to be marginally less powerful than the parametric competitors.
The proposed rank tests enable us to detect differential gene expression and establish interaction effects for microarray data with various non-normally distributed expression measurements across genome. In the presence of outliers, they are advantageous alternative approaches to the existing parametric F tests due to the robustness feature.
Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset.
We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours.
We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers.
Cancer; Outliers; Expression data; Expression profile; Cluster; Subtype; Heterogeneous; Bioinformatics; Percentile; Feature selection
Molecular heterogeneity of cancer, partially caused by various chromosomal aberrations or gene mutations, can yield substantial heterogeneity in gene expression profile in cancer samples. To detect cancer-related genes which are active only in a subset of cancer samples or cancer outliers, several methods have been proposed in the context of multiple testing. Such cancer outlier analyses will generally suffer from a serious lack of power, compared with the standard multiple testing setting where common activation of genes across all cancer samples is supposed. In this paper, we consider information sharing across genes and cancer samples, via a parametric normal mixture modeling of gene expression levels of cancer samples across genes after a standardization using the reference, normal sample data. A gene-based statistic for gene selection is developed on the basis of a posterior probability of cancer outlier for each cancer sample. Some efficiency improvement by using our method was demonstrated, even under settings with misspecified, heavy-tailed t-distributions. An application to a real dataset from hematologic malignancies is provided.
Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in analysis of longitudinal data. Both involve estimation of the covariance function. Yet, challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function is proposed by imposing a parametric correlation structure while allowing a nonparametric variance function. A kernel estimator is developed for the estimation of the nonparametric variance function. Two methods, a quasi-likelihood approach and a minimum generalized variance method, are proposed for estimating parameters in the correlation structure. We introduce a semiparametric varying coefficient partially linear model for longitudinal data and propose an estimation procedure for model coefficients by using a profile weighted least squares approach. Sampling properties of the proposed estimation procedures are studied and asymptotic normality of the resulting estimators is established. Finite sample performance of the proposed procedures is assessed by Monte Carlo simulation studies. The proposed methodology is illustrated by an analysis of a real data example.
Kernel regression; local linear regression; profile weighted least squares; semiparametric varying coefficient model
In this paper we consider the problem of estimating nonparametric panel data models with fixed effects. We introduce an iterative nonparametric kernel estimator. We also extend the estimation method to the case of a semiparametric partially linear fixed effects model. To determine whether a parametric, semiparametric or nonparametric model is appropriate, we propose test statistics to test between the three alternatives in practice. We further propose a test statistic for testing the null hypothesis of random effects against fixed effects in a nonparametric panel data regression model. Simulations are used to examine the finite sample performance of the proposed estimators and the test statistics.
Fixed effects models; Model specification tests; Nonparametric kernel method; Panel data; Partially linear model; Profile method; Random effects models; Semiparametric efficiency bound
We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis.
Generalized least squares; Heteroscedasticity; Large p small n; Model selection; Sparse regression; Variance estimation
Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to bio-logical diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis- ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers.
We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non- outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs.
The proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers.
Cancer; Outlier; Differentially expressed genes; Microarray
A nonparametric bootstrap was used to obtain an interval estimate of Pearson’s r, and test the null hypothesis that there was no association between 5th grade students’ positive substance use expectancies and their intentions to not use substances. The students were participating in a substance use prevention program in which the unit of randomization was a public middle school. The bootstrap estimate indicated that expectancies explained 21% of the variability in students’ intentions (r = 0.46, 95% CI = [0.40, 0.50]). This case study illustrates the use of a nonparametric bootstrap with cluster randomized data and the danger posed if outliers are not identified and addressed. Editors’ Strategic Implications: Prevention researchers will benefit from the authors’ detailed description of this nonparametric bootstrap approach for cluster randomized data and their thoughtful discussion of the potential impact of cluster sizes and outliers.
Cluster randomization; Nonparametric bootstrap; Confidence interval; Pearson’s r
In this article, we study nonparametric conditional quantile estimation via neural network structure. We proposed an estimation method that combines quantile regression and neural network (robust neural network, RNN). It provides good smoothing performance in the presence of outliers and can be used to construct prediction bands. A Majorization-Minimization (MM) algorithm was developed for optimization. Monte Carlo simulation study is conducted to assess the performance of RNN. Comparison with other nonparametric regression methods (e.g., local linear regression and regression splines) in real data application demonstrate the advantage of the newly proposed procedure.
Conditional quantile; Nonparametric regression
In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information.
We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers.
Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.
RNA-seq, a massive parallel-sequencing-based transcriptome profiling method, provides digital data in the form of aligned sequence read counts. The comparative analyses of the data require appropriate statistical methods to estimate the differential expression of transcript variants across different cell/tissue types and disease conditions.
We developed a novel nonparametric empirical Bayesian-based approach (NPEBseq) to model the RNA-seq data. The prior distribution of the Bayesian model is empirically estimated from the data without any parametric assumption, and hence the method is “nonparametric” in nature. Based on this model, we proposed a method for detecting differentially expressed genes across different conditions. We also extended this method to detect differential usage of exons from RNA-seq data. The evaluation of NPEBseq on both simulated and publicly available RNA-seq datasets and comparison with three popular methods showed improved results for experiments with or without biological replicates.
NPEBseq can successfully detect differential expression between different conditions not only at gene level but also at exon level from RNA-seq datasets. In addition, NPEBSeq performs significantly better than current methods and can be applied to genome-wide RNA-seq datasets. Sample datasets and R package are available at http://bioinformatics.wistar.upenn.edu/NPEBseq.
In many statistical applications, data are collected over time, and they are likely correlated. In this paper, we investigate how to incorporate the correlation information into the local linear regression. Under the assumption that the error process is an auto-regressive process, a new estimation procedure is proposed for the nonparametric regression by using local linear regression method and the profile least squares techniques. We further propose the SCAD penalized profile least squares method to determine the order of auto-regressive process. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed procedure, and to compare the performance of the proposed procedures with the existing one. From our empirical studies, the newly proposed procedures can dramatically improve the accuracy of naive local linear regression with working-independent error structure. We illustrate the proposed methodology by an analysis of real data set.
Auto-regressive error; local linear regression; partially linear model; profile least squares; SCAD
In a time-course microarray experiment, the expression level for each gene is observed across a number of time-points in order to characterize the temporal trajectories of the gene-expression profiles. For many of these experiments, the scientific aim is the identification of genes for which the trajectories depend on an experimental or phenotypic factor. There is an extensive recent body of literature on statistical methodology for addressing this analytical problem. Most of the existing methods are based on estimating the time-course trajectories using parametric or non-parametric mean regression methods. The sensitivity of these regression methods to outliers, an issue that is well documented in the statistical literature, should be of concern when analyzing microarray data.
In this paper, we propose a robust testing method for identifying genes whose expression time profiles depend on a factor. Furthermore, we propose a multiple testing procedure to adjust for multiplicity.
Through an extensive simulation study, we will illustrate the performance of our method. Finally, we will report the results from applying our method to a case study and discussing potential extensions.
We developed a quality assurance (QA) tool, namely microarray outlier filter (MOF), and have applied it to our microarray datasets for the identification of problematic arrays. Our approach is based on the comparison of the arrays using the correlation coefficient and the number of outlier spots generated on each array to reveal outlier arrays. For a human universal reference (HUR) dataset, which is used as a technical control in our standard hybridization procedure, 3 outlier arrays were identified out of 35 experiments. For a human blood dataset, 12 outlier arrays were identified from 185 experiments. In general, arrays from human blood samples displayed greater variation in their gene expression profiles than arrays from HUR samples. As a result, MOF identified two distinct patterns in the occurrence of outlier arrays. These results demonstrate that this methodology is a valuable QA practice to identify questionable microarray data prior to downstream analysis.
In high-throughput studies involving genetic data such as from gene expression microarrays, differential expression analysis between two or more experimental conditions has been a very common analytical task. Much of the resulting literature on multiple comparisons has paid relatively little attention to the choice of test statistic. In this article, we focus on the issue of choice of test statistic based on a special pattern of differential expression. The approach here is based on recasting multiple comparisons procedures for assessing outlying expression values. A major complication is that the resulting p-values are discrete; some theoretical properties of sequential testing procedures in this context are explored. We propose the use of q-value estimation procedures in this setting. Data from a gene expression profiling experiment in prostate cancer are used to illustrate the methodology.
The use of conditional logistic regression models to analyze matched case-control data has become standard in statistical analysis. However, methods to test the fit of these models has primarily focused on influential observations and the presence of outliers, while little attention has been given to the functional form of the covariates. In this paper we present methods to test the functional form of the covariates in the conditional logistic regression model, these methods are based on nonparametric smoothers. We assess the performance of the proposed methods via simulation studies and illustrate an example of their use on data from a community based intervention.
Conditional logistic regression; Kernel smoother; Model diagnostics
OBJECTIVE. The goal of this study was to develop unbiased risk-assessment models to be used for paying health plans on the basis of enrollee health status and use propensity. We explored the risk structure of adult employed HMO members using self-reported morbidities, functional status, perceived health status, and demographic characteristics. DATA SOURCES/STUDY SETTING. Data were collected on a random sample of members of a large, federally qualified, prepaid group practice, hospital-based HMO located in the Pacific Northwest. STUDY DESIGN. Multivariate linear nonparametric techniques were used to estimate risk weights on demographic, morbidity, and health status factors at the individual level. The dependent variable was annual real total health plan expense for covered services for the year following the survey. Repeated random split-sample validation techniques minimized outlier influences and avoided inappropriate distributional assumptions required by parametric techniques. DATA COLLECTION/EXTRACTION METHODS. A mail questionnaire containing an abbreviated medical history and the RAND-36 Health Survey was administered to a 5 percent sample of adult subscribers and their spouses in 1990 and 1991, with an overall 44 percent response rate. Utilization data were extracted from HMO automated information systems. Annual expenses were computed by weighting all utilization elements by standard unit costs for the HMO. PRINCIPAL FINDINGS. Prevalence of such major chronic diseases as heart disease, diabetes, depression, and asthma improve prediction of future medical expense; functional health status and morbidities are each better than simple demographic factors alone; functional and perceived health status as well as demographic characteristics and diagnoses together yield the best prediction performance and reduce opportunities for selection bias. We also found evidence of important interaction effects between functional/perceived health status scales and disease classes. CONCLUSIONS. Self-reported morbidities and functional health status are useful risk measures for adults. Risk-assessment research should focus on combining clinical information with social survey techniques to capitalize on the strengths of both approaches. Disease-specific functional health status scales should be developed and tested to capture the most information for prediction.