Correlated sampling Monte Carlo methods can shorten computing times in brachytherapy treatment planning. Monte Carlo efficiency is typically estimated via efficiency gain, defined as the reduction in computing time by correlated sampling relative to conventional Monte Carlo methods when equal statistical uncertainties have been achieved. The determination of the efficiency gain uncertainty arising from random effects, however, is not a straightforward task specially when the error distribution is non-normal. The purpose of this study is to evaluate the applicability of the F distribution and standardized uncertainty propagation methods (widely used in metrology to estimate uncertainty of physical measurements) for predicting confidence intervals about efficiency gain estimates derived from single Monte Carlo runs using fixed-collision correlated sampling in a simplified brachytherapy geometry. A bootstrap based algorithm was used to simulate the probability distribution of the efficiency gain estimates and the shortest 95% confidence interval was estimated from this distribution. It was found that the corresponding relative uncertainty was as large as 37% for this particular problem. The uncertainty propagation framework predicted confidence intervals reasonably well; however its main disadvantage was that uncertainties of input quantities had to be calculated in a separate run via a Monte Carlo method. The F distribution noticeably underestimated the confidence interval. These discrepancies were influenced by several photons with large statistical weights which made extremely large contributions to the scored absorbed dose difference. The mechanism of acquiring high statistical weights in the fixed-collision correlated sampling method was explained and a mitigation strategy was proposed.
Monte Carlo; correlated sampling; efficiency; uncertainty; bootstrap
The Monte Carlo method provides the most accurate dose calculations on a patient computed tomography (CT) geometry. The increase in accuracy is, at least in part, due to the fact that instead of treating human tissues as water of various densities as in analytical algorithms, the Monte Carlo method allows human tissues to be characterized by elemental composition and mass density, and hence allows the accurate consideration of all relevant electromagnetic and nuclear interactions. On the other hand, the algorithm to convert CT Hounsfield numbers to tissue materials for Monte Carlo dose calculation introduces uncertainties. There is not a simple one to one correspondence between Hounsfield numbers and tissue materials. To investigate the effects of Hounsfield number conversion for proton Monte Carlo dose calculations, clinical proton treatment plans were simulated using the Geant4 Monte Carlo code. Three Hounsfield number to material conversion methods were studied. The results were compared in forms of dose volume histograms of gross tumor volume and clinical target volume. The differences found are generally small but can be dosimetrically significant. Further, different methods may cause deviations in the predicted proton beam range in particular for deep proton fields. Typically, slight discrepancies in mass density assignments play only a minor role in the target region, whereas more significant effects are caused by different assignments in elemental compositions. In the presence of large tissue inhomogeneities, for head and neck treatments, treatment planning decisions could be affected by these differences because of deviations in the predicted tumor coverage. Outside the target area, differences in elemental composition and mass density assignments both may play a role. This can lead to pronounced effects for organs at risk, in particular in the spread-out Bragg peak penumbra or distal regions. In addition, the significance of the elemental composition effect (dose to water vs. dose to tissue) is tissue-type dependent and is also affected by nuclear reactions.
Geant4; Monte Carlo; proton therapy; CT Hounsfield conversion
Randomization is a hallmark of clinical trials. If a trial entails very few subjects and has many prognostic factors (or many factor levels) to be balanced, minimization is a more efficient method to achieve balance than a simple randomization. We propose a novel minimization method, the ‘two-way minimization’. The method separately calculates the ‘imbalance in the total numbers of subjects’ and the ‘imbalance in the distributions of prognostic factors’. And then to allocate a subject, it chooses—by probability—to minimize either one of these two aspects of imbalances. As such, it is a method that is both treatment-adaptive and covariate-adaptive. We perform Monte-Carlo simulations to examine its statistical properties. The two-way minimization (with proper regression adjustment of the force-balanced prognostic factors) has the correct type I error rates. It also produces point estimates that are unbiased and variance estimates that are accurate. When there are important prognostic factors to be balanced in the study, the method achieves the highest power and the smallest variance among randomization methods that are resistant to selection bias. The allocation can be done in real time and the subsequent data analysis is straightforward. The two-way minimization is recommended to balance prognostic factors in small trials.
The spatial and space-time scan statistics are commonly applied for the detection of geographical disease clusters. Monte Carlo hypothesis testing is typically used to test whether the geographical clusters are statistically significant as there is no known way to calculate the null distribution analytically. In Monte Carlo hypothesis testing, simulated random data are generated multiple times under the null hypothesis, and the p-value is r/(R + 1), where R is the number of simulated random replicates of the data and r is the rank of the test statistic from the real data compared to the same test statistics calculated from each of the random data sets. A drawback to this powerful technique is that each additional digit of p-value precision requires ten times as many replicated datasets, and the additional processing can lead to excessive run times.
We propose a new method for obtaining more precise p-values with a given number of replicates. The collection of test statistics from the random replicates is used to estimate the true distribution of the test statistic under the null hypothesis by fitting a continuous distribution to these observations. The choice of distribution is critical, and for the spatial and space-time scan statistics, the extreme value Gumbel distribution performs very well while the gamma, normal and lognormal distributions perform poorly. From the fitted Gumbel distribution, we show that it is possible to estimate the analytical p-value with great precision even when the test statistic is far out in the tail beyond any of the test statistics observed in the simulated replicates. In addition, Gumbel-based rejection probabilities have smaller variability than Monte Carlo-based rejection probabilities, suggesting that the proposed approach may result in greater power than the true Monte Carlo hypothesis test for a given number of replicates.
For large data sets, it is often advantageous to replace computer intensive Monte Carlo hypothesis testing with this new method of fitting a Gumbel distribution to random data sets generated under the null, in order to reduce computation time and obtain much more precise p-values and slightly higher statistical power.
Coalescent theory is a powerful tool for population geneticists as well as molecular biologists interested in understanding the patterns and levels of DNA variation. Using coalescent Monte Carlo simulations it is possible to obtain the empirical distributions for a number of statistics across a wide range of evolutionary models; these distributions can be used to test evolutionary hypotheses using experimental data. The mlcoalsim application presented here (based on a version of the ms program, Hudson, 2002) adds important new features to improve methodology (uncertainty and conditional methods for mutation and recombination), models (including strong positive selection, finite sites and heterogeneity in mutation and recombination rates) and analyses (calculating a number of statistics used in population genetics and P-values for observed data). One of the most important features of mlcoalsim is the analysis of multilocus data in linked and independent regions. In summary, mlcoalsim is an integrated software application aimed at researchers interested in molecular evolution. mlcoalsim is written in ANSI C and is available at: http://www.ub.es/softevol/mlcoalsim.
Neutrality tests; Rejection algorithm; Population Genetics; Multilocus analyses; Coalescent simulations
Model choice is one of the most crucial aspect in any statistical data analysis. It is well known that most models are just an approximation to the true data generating process but among such model approximations it is our goal to select the “best” one. Researchers typically consider a finite number of plausible models in statistical applications and the related statistical inference depends on the chosen model. Hence model comparison is required to identify the “best” model among several such candidate models. This article considers the problem of model selection for spatial data. The issue of model selection for spatial models has been addressed in the literature by the use of traditional information criteria based methods, even though such criteria have been developed based on the assumption of independent observations. We evaluate the performance of some of the popular model selection critera via Monte Carlo simulation experiments using small to moderate samples. In particular, we compare the performance of some of the most popular information criteria such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Corrected AIC (AICc) in selecting the true model. The ability of these criteria to select the correct model is evaluated under several scenarios. This comparison is made using various spatial covariance models ranging from stationary isotropic to nonstationary models.
model selection; spatial models; information criteria
In-vivo measurement of bone lead by means of K-X ray fluorescence (KXRF) is the preferred biological marker of chronic exposure to lead. Unfortunately, considerable measurement error associated with KXRF estimations can introduce bias in estimates of the effect of bone lead when this variable is included as the exposure in a regression model. Estimates of uncertainty reported by the KXRF instrument reflect the variance of the measurement error and, although they can be used to correct the measurement error bias, they are seldom used in epidemiological statistical analyses. Errors-in-variables regression (EIV) allows for correction of bias caused by measurement error in predictor variables, based on the knowledge of the reliability of such variables. The authors propose a way to obtain reliability coefficients for bone lead measurements from uncertainty data reported by the KXRF instrument and compare, by use of Monte Carlo simulations, results obtained using EIV regression models versus those obtained by the standard procedures. Results of the simulations show that Ordinary Least Square (OLS) regression models provide severely biased estimates of effect, and that EIV provides nearly unbiased estimates. Although EIV effect estimates are more imprecise, their mean squared error is much smaller than that of OLS estimates. In conclusion, EIV is a better alternative than OLS to estimate the effect of bone lead when measured by KXRF.
Lead; KXRF; measurement error; errors-in-variables model; simulations
To use a Monte Carlo simulation to predict postoperative results with the AcrySof® Toric lens, evaluating the likelihood of over- or under-correction using various toric lens selection criteria.
Keratometric data were obtained from a large patient population with preoperative corneal astigmatism <= 2.50D (2,000 eyes). The probability distributions for toric marking accuracy, surgically induced astigmatism and lens rotation were estimated using available data. Anticipated residual astigmatism was calculated using a Monte Carlo simulation under two different lens selection scenarios.
This simulation demonstrated that random errors in alignment, surgically induced astigmatism and lens rotation slightly reduced the overall effect of the toric lens. Residual astigmatism was statistically significantly higher under the simulation of surgery relative to an exact calculation (p < 0.05). The simulation also demonstrated that more aggressive lens selection criteria could produce clinically significant reductions in residual astigmatism in a high percentage of patients.
Monte Carlo simulation suggests that surgical variability and lens orientation/rotation variability may combine to produce small reductions in the correction achieved with the AcrySof® Toric® IOL. Adopting more aggressive lens selection criteria may yield significantly lower residual astigmatism values for many patients, with negligible overcorrections. Surgeons are encouraged to evaluate their AcrySof® Toric® outcomes to determine if they should modify their individual lens selection criteria, or their default surgically induced astigmatism value, to benefit their patients.
Statistical power calculations constitute an essential first step in the planning of scientific studies. If sufficient summary statistics are available, power calculations are in principle straightforward and computationally light. In designs, which comprise distinct groups (e.g., MZ & DZ twins), sufficient statistics can be calculated within each group, and analyzed in a multi-group model. However, when the number of possible groups is prohibitively large (say, in the hundreds), power calculations on the basis of the summary statistics become impractical. In that case, researchers may resort to Monte Carlo based power studies, which involve the simulation of hundreds or thousands of replicate samples for each specified set of population parameters. Here we present exact data simulation as a third method of power calculation. Exact data simulation involves a transformation of raw data so that the data fit the hypothesized model exactly. As in power calculation with summary statistics, exact data simulation is computationally light, while the number of groups in the analysis has little bearing on the practicality of the method. The method is applied to three genetic designs for illustrative purposes.
Protein-DNA docking is a very challenging problem in structural bioinformatics and has important implications in a number of applications, such as structure-based prediction of transcription factor binding sites and rational drug design. Protein-DNA docking is very computational demanding due to the high cost of energy calculation and the statistical nature of conformational sampling algorithms. More importantly, experiments show that the docking quality depends on the coverage of the conformational sampling space. It is therefore desirable to accelerate the computation of the docking algorithm, not only to reduce computing time, but also to improve docking quality.
In an attempt to accelerate the sampling process and to improve the docking performance, we developed a graphics processing unit (GPU)-based protein-DNA docking algorithm. The algorithm employs a potential-based energy function to describe the binding affinity of a protein-DNA pair, and integrates Monte-Carlo simulation and a simulated annealing method to search through the conformational space. Algorithmic techniques were developed to improve the computation efficiency and scalability on GPU-based high performance computing systems.
The effectiveness of our approach is tested on a non-redundant set of 75 TF-DNA complexes and a newly developed TF-DNA docking benchmark. We demonstrated that the GPU-based docking algorithm can significantly accelerate the simulation process and thereby improving the chance of finding near-native TF-DNA complex structures. This study also suggests that further improvement in protein-DNA docking research would require efforts from two integral aspects: improvement in computation efficiency and energy function design.
We present a high performance computing approach for improving the prediction accuracy of protein-DNA docking. The GPU-based docking algorithm accelerates the search of the conformational space and thus increases the chance of finding more near-native structures. To the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem.
The field of complex biomechanical modeling has begun to rely on Monte Carlo techniques to investigate the effects of parameter variability and measurement uncertainty on model outputs, search for optimal parameter combinations, and define model limitations. However, advanced stochastic methods to perform data-driven explorations, such as Markov chain Monte Carlo (MCMC), become necessary as the number of model parameters increases. Here, we demonstrate the feasibility and, what to our knowledge is, the first use of an MCMC approach to improve the fitness of realistically large biomechanical models. We used a Metropolis–Hastings algorithm to search increasingly complex parameter landscapes (3, 8, 24, and 36 dimensions) to uncover underlying distributions of anatomical parameters of a “truth model” of the human thumb on the basis of simulated kinematic data (thumbnail location, orientation, and linear and angular velocities) polluted by zero-mean, uncorrelated multivariate Gaussian “measurement noise.” Driven by these data, ten Markov chains searched each model parameter space for the subspace that best fit the data (posterior distribution). As expected, the convergence time increased, more local minima were found, and marginal distributions broadened as the parameter space complexity increased. In the 36-D scenario, some chains found local minima but the majority of chains converged to the true posterior distribution (confirmed using a cross-validation dataset), thus demonstrating the feasibility and utility of these methods for realistically large biomechanical problems.
Bayesian statistics; biomechanical model; Markov chain Monte Carlo (MCMC); Metropolis–Hastings algorithm; parameter estimation; thumb
It is often desirable to separate voxels that contain signal from tissue along with measurement noise from those that contain purely measurement noise. Generally this separation called thresholding utilizes only the magnitude portion of the images. Recently methods have been developed that utilize both the magnitude and phase for thresholding voxels. This manuscript is an extension previous work and uses the bivariate normality of the real and imaginary values with phase coupled means. A likelihood ratio statistic is derived that simplifies to a more familiar form that is F-distributed in large samples. It is shown that in small samples, critical values from Monte Carlo simulation can be used to threshold this statistic with the proper Type I and Type II error rates. This method is applied to susceptibility weighted magnetic resonance images and shown to produce increased tissue contrast.
Complex; Magnitude; Phase; Threshold; SWI
Uncertainty in comparative analyses can come from at least two sources: a) phylogenetic uncertainty in the tree topology or branch lengths, and b) uncertainty due to intraspecific variation in trait values, either due to measurement error or natural individual variation. Most phylogenetic comparative methods do not account for such uncertainties. Not accounting for these sources of uncertainty leads to false perceptions of precision (confidence intervals will be too narrow) and inflated significance in hypothesis testing (e.g. p-values will be too small). Although there is some application-specific software for fitting Bayesian models accounting for phylogenetic error, more general and flexible software is desirable.
We developed models to directly incorporate phylogenetic uncertainty into a range of analyses that biologists commonly perform, using a Bayesian framework and Markov Chain Monte Carlo analyses.
We demonstrate applications in linear regression, quantification of phylogenetic signal, and measurement error models. Phylogenetic uncertainty was incorporated by applying a prior distribution for the phylogeny, where this distribution consisted of the posterior tree sets from Bayesian phylogenetic tree estimation programs. The models were analysed using simulated data sets, and applied to a real data set on plant traits, from rainforest plant species in Northern Australia. Analyses were performed using the free and open source software OpenBUGS and JAGS.
Incorporating phylogenetic uncertainty through an empirical prior distribution of trees leads to more precise estimation of regression model parameters than using a single consensus tree and enables a more realistic estimation of confidence intervals. In addition, models incorporating measurement errors and/or individual variation, in one or both variables, are easily formulated in the Bayesian framework. We show that BUGS is a useful, flexible general purpose tool for phylogenetic comparative analyses, particularly for modelling in the face of phylogenetic uncertainty and accounting for measurement error or individual variation in explanatory variables. Code for all models is provided in the BUGS model description language.
In DNA microarray gene expression profiling studies, a fundamental task is to extract statistically significant genes that meet certain research hypothesis. Currently, Venn diagram is a frequently used method for identifying overlapping genes that meet the investigator's research hypotheses. However this simple operation of intersecting multiple gene lists, known as the Intersection-Union Tests (IUTs), is performed without knowing the incurred changes in Type 1 error rate and can lead to loss of discovery power.
We developed an IUT adjustment procedure, called Relaxed IUT (RIUT), which is proved to be less conservative and more powerful for intersecting independent tests than the traditional Venn diagram approach. The advantage of the RIUT procedure over traditional IUT is demonstrated by empirical Monte-Carlo simulation and two real toxicogenomic gene expression case studies. Notably, the enhanced power of RIUT enables it to identify overlapping gene sets leading to identification of certain known related pathways which were not detected using the traditional IUT method.
We showed that traditional IUT via a Venn diagram is generally conservative, which may lead to loss discovery power in DNA microarray studies. RIUT is proved to be a more powerful alternative for performing IUTs in identifying overlapping genes from multiple gene lists derived from microarray gene expression profiling.
The Monte Carlo simulation of sequence evolution is routinely used to assess the performance of phylogenetic inference methods and sequence alignment algorithms. Progress in the field of molecular evolution fuels the need for more realistic and hence more complex simulations, adapted to particular situations, yet current software makes unreasonable assumptions such as homogeneous substitution dynamics or a uniform distribution of indels across the simulated sequences. This calls for an extensible simulation framework written in a high-level functional language, offering new functionality and making it easy to incorporate further complexity.
PhyloSim is an extensible framework for the Monte Carlo simulation of sequence evolution, written in R, using the Gillespie algorithm to integrate the actions of many concurrent processes such as substitutions, insertions and deletions. Uniquely among sequence simulation tools, PhyloSim can simulate arbitrarily complex patterns of rate variation and multiple indel processes, and allows for the incorporation of selective constraints on indel events. User-defined complex patterns of mutation and selection can be easily integrated into simulations, allowing PhyloSim to be adapted to specific needs.
Close integration with R and the wide range of features implemented offer unmatched flexibility, making it possible to simulate sequence evolution under a wide range of realistic settings. We believe that PhyloSim will be useful to future studies involving simulated alignments.
The chest-wall layer underneath breast tissue consists of muscles and bones, which induce distortion in near-infrared diffused waves measured at distant source–detector pairs when reflection geometry is used. A priori information on chest-wall depth obtained from coregistered real-time ultrasound can be used to assist in the removal of distant measurements. We applied Monte Carlo simulation to a simple two-layer model consisting of breast tissue and a chest wall to investigate chest-wall-induced distortion. The Monte Carlo method indicates that, when more than 50% of the received photons travel through the breast tissue layer before being detected, the detected signal may be useful for image reconstruction. The results of phantom experiments obtained from the two-layer model further validate the distortion problem and demonstrate imaging improvement after distant measurements have been filtered out. Clinical examples have shown similar imaging improvements on reconstructed absorption maps. Clinical data obtained from 20 patients with the chest-wall depths of less than 2 cm from the skin surface suggest that the cutoff distances of distorted measurements are largely related to the chest-wall depth and are relatively independent of the optical properties of tissue.
Cronbach's alpha is widely used as the preferred index of reliability for medical postgraduate examinations. A value of 0.8-0.9 is seen by providers and regulators alike as an adequate demonstration of acceptable reliability for any assessment. Of the other statistical parameters, Standard Error of Measurement (SEM) is mainly seen as useful only in determining the accuracy of a pass mark. However the alpha coefficient depends both on SEM and on the ability range (standard deviation, SD) of candidates taking an exam. This study investigated the extent to which the necessarily narrower ability range in candidates taking the second of the three part MRCP(UK) diploma examinations, biases assessment of reliability and SEM.
a) The interrelationships of standard deviation (SD), SEM and reliability were investigated in a Monte Carlo simulation of 10,000 candidates taking a postgraduate examination. b) Reliability and SEM were studied in the MRCP(UK) Part 1 and Part 2 Written Examinations from 2002 to 2008. c) Reliability and SEM were studied in eight Specialty Certificate Examinations introduced in 2008-9.
The Monte Carlo simulation showed, as expected, that restricting the range of an assessment only to those who had already passed it, dramatically reduced the reliability but did not affect the SEM of a simulated assessment. The analysis of the MRCP(UK) Part 1 and Part 2 written examinations showed that the MRCP(UK) Part 2 written examination had a lower reliability than the Part 1 examination, but, despite that lower reliability, the Part 2 examination also had a smaller SEM (indicating a more accurate assessment). The Specialty Certificate Examinations had small Ns, and as a result, wide variability in their reliabilities, but SEMs were comparable with MRCP(UK) Part 2.
An emphasis upon assessing the quality of assessments primarily in terms of reliability alone can produce a paradoxical and distorted picture, particularly in the situation where a narrower range of candidate ability is an inevitable consequence of being able to take a second part examination only after passing the first part examination. Reliability also shows problems when numbers of candidates in examinations are low and sampling error affects the range of candidate ability. SEM is not subject to such problems; it is therefore a better measure of the quality of an assessment and is recommended for routine use.
Motivation: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression (DE) estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression.
Results: We present Bayesian inference of transcripts from sequencing data (BitSeq), a Bayesian approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative expression is represented by Markov chain Monte Carlo samples from the posterior probability distribution of a generative model of the read data. We propose a novel method for DE analysis across replicates which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior. We demonstrate the advantages of our method using simulated data as well as an RNA-seq dataset with technical and biological replication for both studied conditions.
Availability: The implementation of the transcriptome expression estimation and differential expression analysis, BitSeq, has been written in C++ and Python. The software is available online from http://code.google.com/p/bitseq/, version 0.4 was used for generating results presented in this article.
email@example.com, firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
Two analytical procedures for identifying young children as categorizers, the Monte Carlo Simulation and the Probability Estimate Model, were compared. Using a sequential touching method, children age 12, 18, 24, and 30 months were given seven object sets representing different levels of categorical classification. From their touching performance, the probability that children were categorizing was then determined independently using Monte Carlo Simulation and the Probability Estimate Model. The two analytical procedures resulted in different percentages of children being classified as categorizers. Results using the Monte Carlo Simulation were more consistent with group-level analyses than results using the Probability Estimate Model. These findings recommend using the Monte Carlo Simulation for determining individual categorizer classification.
categorization; categorizer classification; toddlers
In continuous wave (CW) electron paramagnetic resonance imaging (EPRI), high quality of reconstructed image along with fast and reliable data acquisition is highly desirable for many biological applications. An accurate representation of uniform distribution of projection data is necessary to ensure high reconstruction quality. The current techniques for data acquisition suffer from nonuniformities or local anisotropies in the distribution of projection data and present a poor approximation of a true uniform and isotropic distribution. In this work, we have implemented a technique based on Quasi-Monte Carlo method to acquire projections with more uniform and isotropic distribution of data over a 3D acquisition space. The proposed technique exhibits improvements in the reconstruction quality in terms of both mean-square-error and visual judgment. The effectiveness of the suggested technique is demonstrated using computer simulations and 3D EPRI experiments. The technique is robust and exhibits consistent performance for different object configurations and orientations.
Projection Acquisition; Artifacts; Reconstruction; EPRI
Classification studies with high-dimensional measurements and relatively small sample sizes are increasingly common. Prospective analysis of the role of sample sizes in the performance of such studies is important for study design and interpretation of results, but the complexity of typical pattern discovery methods makes this problem challenging. The approach developed here combines Monte Carlo methods and new approximations for linear discriminant analysis, assuming multivariate normal distributions. Monte Carlo methods are used to sample the distribution of which features are selected for a classifier and the mean and variance of features given that they are selected. Given selected features, the linear discriminant problem involves different distributions of training data and generalization data, for which 2 approximations are compared: one based on Taylor series approximation of the generalization error and the other on approximating the discriminant scores as normally distributed. Combining the Monte Carlo and approximation approaches to different aspects of the problem allows efficient estimation of expected generalization error without full simulations of the entire sampling and analysis process. To evaluate the method and investigate realistic study design questions, full simulations are used to ask how validation error rate depends on the strength and number of informative features, the number of noninformative features, the sample size, and the number of features allowed into the pattern. Both approximation methods perform well for most cases but only the normal discriminant score approximation performs well for cases of very many weakly informative or uninformative dimensions. The simulated cases show that many realistic study designs will typically estimate substantially suboptimal patterns and may have low probability of statistically significant validation results.
Biomarker discovery; Experimental design; Generalization error; Genomic; Pattern recognition; Proteomic
In [J. Halton, Sequential Monte Carlo, Proc. Comb. Phil. Soc. 58 (1962), J. Halton, Sequential Monte Carlo Techniques for the Solution of Linear Systems, J. Sci. Comp. 9 (1994) 213–257] Halton introduced a strategy to be used in Monte Carlo algorithms for the efficient solution of certain matrix problems. We showed in [R. Kong, J. Spanier, Sequential correlated sampling methods for some transport problems, in: Harold Niederreiter, Jerome Spanier (Eds.), Monte-Carlo and Quasi Monte-Carlo Methods 1998: Proceedings of a Conference at the Claremont Graduate University, Springer-Verlag, New York, 2000, R. Kong, J. Spanier, Error analysis of sequential Monte Carlo methods for transport problems, in: Harold Niederreiter, Jerome Spanier (Eds.), Monte-Carlo and Quasi Monte-Carlo Methods 1998: Proceedings of a Conference at the Claremont Graduate University, Springer-Verlag, New York, 2000] how Halton’s method based on correlated sampling can be extended to continuous transport problems and established their geometric convergence for a family of transport problems in slab geometry. In our algorithm, random walks are processed in batches, called stages, each stage producing a small correction that is added to the approximate solution developed from the previous stages. In this paper, we demonstrate that strict error reduction from stage to stage can be assured under rather general conditions and we illustrate this rapid convergence numerically for a simple family of two dimensional transport problems.
Transport equation; Geometrically convergent Monte Carlo algorithms
A local modal estimation procedure is proposed for the regression function in a non-parametric regression model. A distinguishing characteristic of the proposed procedure is that it introduces an additional tuning parameter that is automatically selected using the observed data in order to achieve both robustness and efficiency of the resulting estimate. We demonstrate both theoretically and empirically that the resulting estimator is more efficient than the ordinary local polynomial regression estimator in the presence of outliers or heavy tail error distribution (such as t-distribution). Furthermore, we show that the proposed procedure is as asymptotically efficient as the local polynomial regression estimator when there are no outliers and the error distribution is a Gaussian distribution. We propose an EM type algorithm for the proposed estimation procedure. A Monte Carlo simulation study is conducted to examine the finite sample performance of the proposed method. The simulation results confirm the theoretical findings. The proposed methodology is further illustrated via an analysis of a real data example.
Adaptive regression; Local polynomial regression; M-estimator; Modal regression; Robust nonparametric regression
Investigation of global clustering patterns across regions is very important in spatial data analysis. Moran's I is a widely used spatial statistic for detecting global spatial patterns such as an east-west trend or an unusually large cluster. Here, we intend to improve Moran's I for evaluating global clustering patterns by including the weight function in the variance, introducing a population density (PD) weight function in the statistics, and conducting Monte Carlo simulation for testing. We compare our modified Moran's I with Oden's I*pop for simulated data with homogeneous populations. The proposed method is applied to a census tract data set.
We present a modified version of Moran's I which includes information about the strength of the neighboring association when estimating the variance for the statistic. We provide a power analysis on Moran's I, a modified version of Moran's I, and I*pop in a simulation study. Data were simulated under two common spatial correlation scenarios of local and global clustering.
For simulated data with a large cluster pattern, the modified Moran's I has the highest power (43.4%) compared to Moran's I (39.9%) and I*pop (12.4%) when the adjacent weight function is used with 5%, 10%, 15%, 20%, or 30% of the total population as the geographic range for the cluster.
For two global clustering patterns, the modified Moran's I (power > 25.3%) performed better than both Moran's I (> 24.6%) and I*pop (> 7.9%) with the adjacent weight function. With the population density weight function, all methods performed equally well.
In the real data example, all statistics indicate the existence of a global clustering pattern in a leukemia data set. The modified Moran's I has the lowest p-value (.0014) followed by Moran's I (.0156) and I*pop (.011).
Our power analysis and simulation study show that the modified Moran's I achieved higher power than Moran's I and I*pop for evaluating global and local clustering patterns on geographic data with homogeneous populations. The inclusion of the PD weight function which in turn redefines the neighbors seems to have a large impact on the power of detecting global clustering patterns. Our methods to improve the original version of Moran's I for homogeneous populations can also be extended to some alternative versions of Moran's I methods developed for heterogeneous populations.
In many statistical applications, data are collected over time, and they are likely correlated. In this paper, we investigate how to incorporate the correlation information into the local linear regression. Under the assumption that the error process is an auto-regressive process, a new estimation procedure is proposed for the nonparametric regression by using local linear regression method and the profile least squares techniques. We further propose the SCAD penalized profile least squares method to determine the order of auto-regressive process. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed procedure, and to compare the performance of the proposed procedures with the existing one. From our empirical studies, the newly proposed procedures can dramatically improve the accuracy of naive local linear regression with working-independent error structure. We illustrate the proposed methodology by an analysis of real data set.
Auto-regressive error; local linear regression; partially linear model; profile least squares; SCAD