Motivation: Permutation tests have become a standard tool to assess the statistical significance of an event under investigation. The statistical significance, as expressed in a P-value, is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data. This empirical method directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Thereby, it imposes upon itself the need for a very large number of permutations when small P-values are to be accurately estimated. This is computationally expensive and often infeasible.
Results: A method of computing P-values based on tail approximation is presented. The tail of the distribution of permutation values is approximated by a generalized Pareto distribution. A good fit and thus accurate P-value estimates can be obtained with a drastically reduced number of permutations when compared with the standard empirical way of computing P-values.
Availability: The Matlab code can be obtained from the corresponding author on request.
Supplementary information:Supplementary data are available at Bioinformatics online.
We present a new method to efficiently estimate very large numbers of p-values using empirically constructed null distributions of a test statistic. The need to evaluate a very large number of p-values is increasingly common with modern genomic data, and when interaction effects are of interest, the number of tests can easily run into billions. When the asymptotic distribution is not easily available, permutations are typically used to obtain p-values but these can be computationally infeasible in large problems. Our method constructs a prediction model to obtain a first approximation to the p-values and uses Bayesian methods to choose a fraction of these to be refined by permutations. We apply and evaluate our method on the study of association between 2-way interactions of genetic markers and colorectal cancer using the data from the first phase of a large, genome-wide case–control study. The results show enormous computational savings as compared to evaluating a full set of permutations, with little decrease in accuracy.
Bayesian testing; Genome-wide association studies; Interaction effects; Permutation distribution; p-value distribution; Random Forest
The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient p-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the p-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100–500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small p-value (e.g. less than 10 − 6). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.
Bootstrap procedures; Genetic association studies; p-value; Resampling-based tests; Stochastic approximation Markov chain Monte Carlo
Large-scale genetic association studies can test hundreds of thousands of genetic markers for association with a trait. Since the genetic markers may be correlated, a Bonferroni correction is typically too stringent a correction for multiple testing. Permutation testing is a standard statistical technique for determining statistical significance when performing multiple correlated tests for genetic association. However, permutation testing for large-scale genetic association studies is computationally demanding and calls for optimized algorithms and software. PRESTO is a new software package for genetic association studies that performs fast computation of multiple-testing adjusted P-values via permutation of the trait.
PRESTO is an order of magnitude faster than other existing permutation testing software, and can analyze a large genome-wide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO has several unique features that are useful in a wide range of studies: it reports empirical null distributions for the top-ranked statistics (i.e. order statistics), it performs user-specified combinations of allelic and genotypic tests, it performs stratified analysis when sampled individuals are from multiple populations and each individual's population of origin is specified, and it determines significance levels for one and two-stage genotyping designs. PRESTO is designed for case-control studies, but can also be applied to trio data (parents and affected offspring) if transmitted parental alleles are coded as case alleles and untransmitted parental alleles are coded as control alleles.
PRESTO is a platform-independent software package that performs fast and flexible permutation testing for genetic association studies. The PRESTO executable file, Java source code, example data, and documentation are freely available at .
Tests for association between a haplotype and disease are commonly performed using a likelihood ratio test for heterogeneity between case and control haplotype frequencies. Using data from a study of association between heroin dependence and the DRD2 gene, we obtained estimated haplotype frequencies and the associated likelihood ratio statistic using two different computer programs, MLOCUS and GENECOUNTING. We also carried out permutation testing to assess the empirical significance of the results obtained.
Both programs yielded similar, though not identical, estimates for the haplotype frequencies. MLOCUS produced a p value of 1.8*10-15 and GENECOUNTING produced a p value of 5.4*10-4. Permutation testing produced a p value 2.8*10-4.
The fact that very large differences occur between the likelihood ratio statistics from the two programs may reflect the fact that the haplotype frequencies for the combined group are not constrained to be equal to the weighted averages of the frequencies for the cases and controls, as they would be if they were directly observed rather than being estimated. Minor differences in haplotype frequency estimates can result in very large differences in the likelihood ratio statistic and associated p value.
Gene Set Enrichment Analysis (GSEA) is a computational method for the statistical evaluation of sorted lists of genes or proteins. Originally GSEA was developed for interpreting microarray gene expression data, but it can be applied to any sorted list of genes. Given the gene list and an arbitrary biological category, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list. Usually, significance scores (p-values) of GSEA are computed by nonparametric permutation tests, a time consuming procedure that yields only estimates of the p-values.
We present a novel dynamic programming algorithm for calculating exact significance values of unweighted Gene Set Enrichment Analyses. Our algorithm avoids typical problems of nonparametric permutation tests, as varying findings in different runs caused by the random sampling procedure. Another advantage of the presented dynamic programming algorithm is its runtime and memory efficiency. To test our algorithm, we applied it not only to simulated data sets, but additionally evaluated expression profiles of squamous cell lung cancer tissue and autologous unaffected tissue.
In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.
We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.
Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.
Pairwise comparison of time series data for both local and time-lagged relationships is a computationally challenging problem relevant to many fields of inquiry. The Local Similarity Analysis (LSA) statistic identifies the existence of local and lagged relationships, but determining significance through a p-value has been algorithmically cumbersome due to an intensive permutation test, shuffling rows and columns and repeatedly calculating the statistic. Furthermore, this p-value is calculated with the assumption of normality -- a statistical luxury dissociated from most real world datasets.
To improve the performance of LSA on big datasets, an asymptotic upper bound on the p-value calculation was derived without the assumption of normality. This change in the bound calculation markedly improved computational speed from O(pm2n) to O(m2n), where p is the number of permutations in a permutation test, m is the number of time series, and n is the length of each time series. The bounding process is implemented as a computationally efficient software package, FASTLSA, written in C and optimized for threading on multi-core computers, improving its practical computation time. We computationally compare our approach to previous implementations of LSA, demonstrate broad applicability by analyzing time series data from public health, microbial ecology, and social media, and visualize resulting networks using the Cytoscape software.
The FASTLSA software package expands the boundaries of LSA allowing analysis on datasets with millions of co-varying time series. Mapping metadata onto force-directed graphs derived from FASTLSA allows investigators to view correlated cliques and explore previously unrecognized network relationships. The software is freely available for download at: http://www.cmde.science.ubc.ca/hallam/fastLSA/.
Motivation: Resampling methods, such as permutation and bootstrap, have been widely used to generate an empirical distribution for assessing the statistical significance of a measurement. However, to obtain a very low P-value, a large size of resampling is required, where computing speed, memory and storage consumption become bottlenecks, and sometimes become impossible, even on a computer cluster.
Results: We have developed a multiple stage P-value calculating program called FastPval that can efficiently calculate very low (up to 10−9) P-values from a large number of resampled measurements. With only two input files and a few parameter settings from the users, the program can compute P-values from empirical distribution very efficiently, even on a personal computer. When tested on the order of 109 resampled data, our method only uses 52.94% the time used by the conventional method, implemented by standard quicksort and binary search algorithms, and consumes only 0.11% of the memory and storage. Furthermore, our method can be applied to extra large datasets that the conventional method fails to calculate. The accuracy of the method was tested on data generated from Normal, Poison and Gumbel distributions and was found to be no different from the exact ranking approach.
Availability: The FastPval executable file, the java GUI and source code, and the java web start server with example data and introduction, are available at http://wanglab.hku.hk/pvalue
Supplementary information: Supplementary data are available at Bioinformatics online and http://wanglab.hku.hk/pvalue/.
Pomelo II (http://pomelo2.bioinfo.cnio.es) is an open-source, web-based, freely available tool for the analysis of gene (and protein) expression and tissue array data. Pomelo II implements: permutation-based tests for class comparisons (t-test, ANOVA) and regression; survival analysis using Cox model; contingency table analysis with Fisher's exact test; linear models (of which t-test and ANOVA are especial cases) that allow additional covariates for complex experimental designs and use empirical Bayes moderated statistics. Permutation-based and Cox model analysis use parallel computing, which permits taking advantage of multicore CPUs and computing clusters. Access to, and further analysis of, additional biological information and annotations (PubMed references, Gene Ontology terms, KEGG and Reactome pathways) are available either for individual genes (from clickable links in tables and figures) or sets of genes. The source code is available, allowing for extending and reusing the software. A comprehensive test suite is also available, and covers both the user interface and the numerical results. The possibility of including additional covariates, parallelization of computation, open-source availability of the code and comprehensive testing suite make Pomelo II a unique tool.
Four applications of permutation tests to the single-mediator model are described and evaluated in this study. Permutation tests work by rearranging data in many possible ways in order to estimate the sampling distribution for the test statistic. The four applications to mediation evaluated here are the permutation test of ab, the permutation joint significance test, and the noniterative and iterative permutation confidence intervals for ab. A Monte Carlo simulation study was used to compare these four tests with the four best available tests for mediation found in previous research: the joint significance test, the distribution of the product test, and the percentile and bias-corrected bootstrap tests. We compared the different methods on Type I error, power, and confidence interval coverage. The noniterative permutation confidence interval for ab was the best performer among the new methods. It successfully controlled Type I error, had power nearly as good as the most powerful existing methods, and had better coverage than any existing method. The iterative permutation confidence interval for ab had lower power than do some existing methods, but it performed better than any other method in terms of coverage. The permutation confidence interval methods are recommended when estimating a confidence interval is a primary concern. SPSS and SAS macros that estimate these confidence intervals are provided.
Mediation; Permutation test
Challenges of satisfying parametric assumptions in genomic settings with thousands or millions of tests have led investigators to combine powerful False Discovery Rate (FDR) approaches with computationally expensive but exact permutation testing. We describe a computationally efficient permutation-based approach that includes a tractable estimator of the proportion of true null hypotheses, the variance of the log of tail-area FDR, and a confidence interval (CI) estimator, which accounts for the number of permutations conducted and dependencies between tests. The CI estimator applies a binomial distribution and an overdispersion parameter to counts of positive tests. The approach is general with regards to the distribution of the test statistic, it performs favorably in comparison to other approaches, and reliable FDR estimates are demonstrated with as few as 10 permutations. An application of this approach to relate sleep patterns to gene expression patterns in mouse hypothalamus yielded a set of 11 transcripts associated with 24 h REM sleep [FDR = 0.15 (0.08, 0.26)]. Two of the corresponding genes, Sfrp1 and Sfrp4, are involved in wnt signaling and several others, Irf7, Ifit1, Iigp2, and Ifih1, have links to interferon signaling. These genes would have been overlooked had a typical a priori FDR threshold such as 0.05 or 0.1 been applied. The CI provides the flexibility for choosing a significance threshold based on tolerance for false discoveries and precision of the FDR estimate. That is, it frees the investigator to use a more data-driven approach to define significance, such as the minimum estimated FDR, an option that is especially useful for weak effects, often observed in studies of complex diseases.
false discovery rates; multiple testing; simultaneous inference; gene expression; sleep
Most existing association tests for genome-wide association studies (GWAS) fail to account for genetic heterogeneity. Zhou and Pan proposed a binomial mixture model based association test to account for the possible genetic heterogeneity in case-control studies. The idea is elegant, however, the proposed test requires an EM-type iterative algorithm to identify the penalized maximum likelihood estimates and a permutation method to assess p-values. The intensive computational burden induced by the EM-algorithm and the permutation becomes prohibitive for direct applications to genome-wide association studies. This paper develops a likelihood ratio test (LRT) for genome-wide association studies under genetic heterogeneity based on a more general alternative mixture model. In particular, a closed-form formula for the likelihood ratio test statistic is derived to avoid the EM-type iterative numerical evaluation. Moreover, an explicit asymptotic null distribution is also obtained which avoids using the permutation to obtain p-values. Thus, the proposed LRT is easy to implement for genome-wide association studies (GWAS). Furthermore, numerical studies demonstrate that the LRT has power advantages over the commonly used Armitage trend test and other existing association tests under genetic heterogeneity. A breast cancer GWAS data set is used to illustrate the newly proposed LRT.
association test; binomial mixture model; complex disease; genetic heterogeneity; genomewide association study
Motivation: ChIPseq is rapidly becoming a common technique for investigating protein–DNA interactions. However, results from individual experiments provide a limited understanding of chromatin structure, as various chromatin factors cooperate in complex ways to orchestrate transcription. In order to quantify chromtain interactions, it is thus necessary to devise a robust similarity metric applicable to ChIPseq data. Unfortunately, moving past simple overlap calculations to give statistically rigorous comparisons of ChIPseq datasets often involves arbitrary choices of distance metrics, with significance being estimated by computationally intensive permutation tests whose statistical power may be sensitive to non-biological experimental and post-processing variation.
Results: We show that it is in fact possible to compare ChIPseq datasets through the efficient computation of exact P-values for proximity. Our method is insensitive to non-biological variation in datasets such as peak width, and can rigorously model peak location biases by evaluating similarity conditioned on a restricted set of genomic regions (such as mappable genome or promoter regions).
Applying our method to the well-studied dataset of Chen et al. (2008), we elucidate novel interactions which conform well with our biological understanding. By comparing ChIPseq data in an asymmetric way, we are able to observe clear interaction differences between cofactors such as p300 and factors that bind DNA directly.
Availability: Source code is available for download at http://sonorus.princeton.edu/IntervalStats/IntervalStats.tar.gz
Supplementary data are available at Bioinformatics online.
Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method such as naïve Bayes or logistic regression. Traditionally, MDR constructed variables have been evaluated with a naïve Bayes classifier that is combined with 10-fold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. Traditionally, we have used permutation testing to statistically evaluate the significance of models obtained through MDR. The advantage of permutation testing is that it controls for false-positives due to multiple testing. The disadvantage is that permutation testing is computationally expensive. This is in an important issue that arises in the context of detecting epistasis on a genome-wide scale. The goal of the present study was to develop and evaluate several alternatives to large-scale permutation testing for assessing the statistical significance of MDR models. Using data simulated from 70 different epistasis models, we compared the power and type I error rate of MDR using a 1000-fold permutation test with hypothesis testing using an extreme value distribution (EVD). We find that this new hypothesis testing method provides a reasonable alternative to the computationally expensive 1000-fold permutation test and is 50 times faster. We then demonstrate this new method by applying it to a genetic epidemiology study of bladder cancer susceptibility that was previously analyzed using MDR and assessed using a 1000-fold permutation test.
Extreme Value Distribution; Permutation Testing; Power; Type I Error; Bladder Cancer; Data Mining
Bayesian network models are commonly used to model gene expression data. Some applications require a comparison of the network structure of a set of genes between varying phenotypes. In principle, separately fit models can be directly compared, but it is difficult to assign statistical significance to any observed differences. There would therefore be an advantage to the development of a rigorous hypothesis test for homogeneity of network structure. In this paper, a generalized likelihood ratio test based on Bayesian network models is developed, with significance level estimated using permutation replications. In order to be computationally feasible, a number of algorithms are introduced. First, a method for approximating multivariate distributions due to Chow and Liu (1968) is adapted, permitting the polynomial-time calculation of a maximum likelihood Bayesian network with maximum indegree of one. Second, sequential testing principles are applied to the permutation test, allowing significant reduction of computation time while preserving reported error rates used in multiple testing. The method is applied to gene-set analysis, using two sets of experimental data, and some advantage to a pathway modelling approach to this problem is reported.
Genome-wide association studies commonly involve simultaneous tests of millions of single nucleotide polymorphisms (SNP) for disease association. The SNPs in nearby genomic regions, however, are often highly correlated due to linkage disequilibrium (LD, a genetic term for correlation). Simple Bonferonni correction for multiple comparisons is therefore too conservative. Permutation tests, which are often employed in practice, are both computationally expensive for genome-wide studies and limited in their scopes. We present an accurate and computationally efficient method, based on Poisson de-clumping heuristics, for approximating genome-wide significance of SNP associations. Compared with permutation tests and other multiple comparison adjustment approaches, our method computes the most accurate and robust p-value adjustments for millions of correlated comparisons within seconds. We demonstrate analytically that the accuracy and the efficiency of our method are nearly independent of the sample size, the number of SNPs, and the scale of p-values to be adjusted. In addition, our method can be easily adopted to estimate false discovery rate. When applied to genome-wide SNP datasets, we observed highly variable p-value adjustment results evaluated from different genomic regions. The variation in adjustments along the genome, however, are well conserved between the European and the African populations. The p-value adjustments are significantly correlated with LD among SNPs, recombination rates, and SNP densities. Given the large variability of sequence features in the genome, we further discuss a novel approach of using SNP-specific (local) thresholds to detect genome-wide significant associations. This article has supplementary material online.
Genome-wide association study; Multiple comparison; Poisson approximation
Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.
We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.
Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at .
Summary: A large number of websites and applications perform significance testing for gene categories/pathways in microarray data. Many of these packages fail to account for expression correlation between transcripts, with a resultant inflation in Type I error. Array permutation and other resampling-based approaches have been proposed as solutions to this problem. SAFEGUI provides a user-friendly graphical interface for the assessment of categorical significance in microarray studies, while properly accounting for the effects of correlations among genes. SAFEGUI incorporates both permutation and more recently proposed bootstrap algorithms that are demonstrated to be more powerful in detecting differential expression across categories of genes.
Contact: firstname.lastname@example.org; email@example.com
Circular Binary Segmentation (CBS) is a permutation-based algorithm for array Comparative Genomic Hybridization (aCGH) data analysis. CBS accurately segments data by detecting change-points using a maximal-t test; but extensive computational burden is involved for evaluating the significance of change-points using permutations. A recent implementation utilizing a hybrid method and early stopping rules (hybrid CBS) to improve the performance in speed was subsequently proposed. However, a time analysis revealed that a major portion of computation time of the hybrid CBS was still spent on permutation. In addition, what the hybrid method provides is an approximation of the significance upper bound or lower bound, not an approximation of the significance of change-points itself.
We developed a novel model-based algorithm, extreme-value based CBS (eCBS), which limits permutations and provides robust results without loss of accuracy. Thousands of aCGH data under null hypothesis were simulated in advance based on a variety of non-normal assumptions, and the corresponding maximal-t distribution was modeled by the Generalized Extreme Value (GEV) distribution. The modeling results, which associate characteristics of aCGH data to the GEV parameters, constitute lookup tables (eXtreme model). Using the eXtreme model, the significance of change-points could be evaluated in a constant time complexity through a table lookup process.
A novel algorithm, eCBS, was developed in this study. The current implementation of eCBS consistently outperforms the hybrid CBS 4× to 20× in computation time without loss of accuracy. Source codes, supplementary materials, supplementary figures, and supplementary tables can be found at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary.
Motivation: Permutation testing is very popular for analyzing microarray data to identify differentially expressed (DE) genes; estimating false discovery rates (FDRs) is a very popular way to address the inherent multiple testing problem. However, combining these approaches may be problematic when sample sizes are unequal.
Results: With unbalanced data, permutation tests may not be suitable because they do not test the hypothesis of interest. In addition, permutation tests can be biased. Using biased P-values to estimate the FDR can produce unacceptable bias in those estimates. Results also show that the approach of pooling permutation null distributions across genes can produce invalid P-values, since even non-DE genes can have different permutation null distributions. We encourage researchers to use statistics that have been shown to reliably discriminate DE genes, but caution that associated P-values may be either invalid, or a less-effective metric for discriminating DE genes.
Supplementary information: Supplementary data are available at Bioinformatics online.
Many analyses of gene expression data involve hypothesis tests of an interaction term between two fixed effects, typically tested using a residual variance. In expression studies, the issue of variance heteroscedasticity has received much attention, and previous work has focused on either between-gene or within-gene heteroscedasticity. However, in a single experiment, heteroscedasticity may exist both within and between genes. Here we develop flexible shrinkage error estimators considering both between-gene and within-gene heteroscedasticity and use them to construct F-like test statistics for testing interactions, with cutoff values obtained by permutation. These permutation tests are complicated, and several permutation tests are investigated here.
Our proposed test statistics are compared with other existing shrinkage-type test statistics through extensive simulation studies and a real data example. The results show that the choice of permutation procedures has dramatically more influence on detection power than the choice of F or F-like test statistics. When both types of gene heteroscedasticity exist, our proposed test statistics can control preselected type-I errors and are more powerful. Raw data permutation is not valid in this setting. Whether unrestricted or restricted residual permutation should be used depends on the specific type of test statistic.
The F-like test statistic that uses the proposed flexible shrinkage error estimator considering both types of gene heteroscedasticity and unrestricted residual permutation can provide a statistically valid and powerful test. Therefore, we recommended that it should always applied in the analysis of real gene expression data analysis to test an interaction term.
Bayesian shrinkage analysis is the state-of-the-art method for whole genome analysis of quantitative traits. It can estimate the genetic effects for the entire genome using a dense marker map. The technique is now called genome selection. A nice property of the shrinkage analysis is that it can estimate effects of QTL as small as explaining 2% of the phenotypic variance in a typical sample size of 300–500 individuals. In most cases, QTL can be detected with simple visual inspection of the entire genome for the effect because the false positive rate is low. As a Bayesian method, no significance test is needed. However, it is still desirable to put some confidences on the estimated QTL effects. We proposed to use the permutation test to draw empirical thresholds to declare significance of QTL under a predetermined genome wide type I error. With the permutation test, Bayesian shrinkage analysis can be routinely used for QTL detection.
Motivation: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.
Results: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
One important application of microarray experiments is to identify differentially expressed genes. Often, small and negative expression levels were clipped-off to be equal to an arbitrarily chosen cutoff value before a statistical test is carried out. Then, there are two types of data: truncated values and original observations. The truncated values are not just another point on the continuum of possible values and, therefore, it is appropriate to combine two statistical tests in a two-part model rather than using standard statistical methods. A similar situation occurs when DNA methylation data are investigated. In that case, there are null values (undetectable methylation) and observed positive values. For these data, we propose a two-part permutation test.
The proposed permutation test leads to smaller p-values in comparison to the original two-part test. We found this for both DNA methylation data and microarray data. With a simulation study we confirmed this result and could show that the two-part permutation test is, on average, more powerful. The new test also reduces, without any loss of power, to a standard test when there are no null or truncated values.
The two-part permutation test can be used in routine analyses since it reduces to a standard test when there are positive values only. Further advantages of the new test are that it opens the possibility to use other test statistics to construct the two-part test and that it avoids the use of any asymptotic distribution. The latter advantage is particularly important for the analysis of microarrays since sample sizes are usually small.