Traditional permutation (TradPerm) tests are usually considered the gold standard for multiple testing corrections. However, they can be difficult to complete for the meta-analyses of genetic association studies based on multiple single nucleotide polymorphism loci as they depend on individual-level genotype and phenotype data to perform random shuffles, which are not easy to obtain. Most meta-analyses have therefore been performed using summary statistics from previously published studies. To carry out a permutation using only genotype counts without changing the size of the TradPerm P-value, we developed a Monte Carlo permutation (MCPerm) method. First, for each study included in the meta-analysis, we used a two-step hypergeometric distribution to generate a random number of genotypes in cases and controls. We then carried out a meta-analysis using these random genotype data. Finally, we obtained the corrected permutation P-value of the meta-analysis by repeating the entire process N times. We used five real datasets and five simulation datasets to evaluate the MCPerm method and our results showed the following: (1) MCPerm requires only the summary statistics of the genotype, without the need for individual-level data; (2) Genotype counts generated by our two-step hypergeometric distributions had the same distributions as genotype counts generated by shuffling; (3) MCPerm had almost exactly the same permutation P-values as TradPerm (r = 0.999; P<2.2e-16); (4) The calculation speed of MCPerm is much faster than that of TradPerm. In summary, MCPerm appears to be a viable alternative to TradPerm, and we have developed it as a freely available R package at CRAN: http://cran.r-project.org/web/packages/MCPerm/index.html.
Power laws are theoretically interesting probability distributions that are also frequently used to describe empirical data. In recent years, effective statistical methods for fitting power laws have been developed, but appropriate use of these techniques requires significant programming and statistical insight. In order to greatly decrease the barriers to using good statistical methods for fitting power law distributions, we developed the powerlaw Python package. This software package provides easy commands for basic fitting and statistical analysis of distributions. Notably, it also seeks to support a variety of user needs by being exhaustive in the options available to the user. The source code is publicly available and easily extensible.
Data arising from social systems is often highly complex, involving non-linear relationships between the macro-level variables that characterize these systems. We present a method for analyzing this type of longitudinal or panel data using differential equations. We identify the best non-linear functions that capture interactions between variables, employing Bayes factor to decide how many interaction terms should be included in the model. This method punishes overly complicated models and identifies models with the most explanatory power. We illustrate our approach on the classic example of relating democracy and economic growth, identifying non-linear relationships between these two variables. We show how multiple variables and variable lags can be accounted for and provide a toolbox in R to implement our approach.
Complex networks abound in physical, biological and social sciences. Quantifying a network’s topological structure facilitates network exploration and analysis, and network comparison, clustering and classification. A number of Wiener type indices have recently been incorporated as distance-based descriptors of complex networks, such as the R package QuACN. Wiener type indices are known to depend both on the network’s number of nodes and topology. To apply these indices to measure similarity of networks of different numbers of nodes, normalization of these indices is needed to correct the effect of the number of nodes in a network. This paper aims to fill this gap. Moreover, we introduce an -Wiener index of network , denoted by . This notion generalizes the Wiener index to a very wide class of Wiener type indices including all known Wiener type indices. We identify the maximum and minimum of over a set of networks with nodes. We then introduce our normalized-version of -Wiener index. The normalized -Wiener indices were demonstrated, in a number of experiments, to improve significantly the hierarchical clustering over the non-normalized counterparts.
Aiming at iron-related T2-hypointensity, which is related to normal aging and neurodegenerative processes, we here present two practicable approaches, based on Bayesian inference, for preprocessing and statistical analysis of a complex set of structural MRI data. In particular, Markov Chain Monte Carlo methods were used to simulate posterior distributions. First, we rendered a segmentation algorithm that uses outlier detection based on model checking techniques within a Bayesian mixture model. Second, we rendered an analytical tool comprising a Bayesian regression model with smoothness priors (in the form of Gaussian Markov random fields) mitigating the necessity to smooth data prior to statistical analysis. For validation, we used simulated data and MRI data of 27 healthy controls (age: ; range, ). We first observed robust segmentation of both simulated T2-hypointensities and gray-matter regions known to be T2-hypointense. Second, simulated data and images of segmented T2-hypointensity were analyzed. We found not only robust identification of simulated effects but also a biologically plausible age-related increase of T2-hypointensity primarily within the dentate nucleus but also within the globus pallidus, substantia nigra, and red nucleus. Our results indicate that fully Bayesian inference can successfully be applied for preprocessing and statistical analysis of structural MRI data.
This paper examines the multiple atlas random diffeomorphic orbit model in Computational Anatomy (CA) for parameter estimation and segmentation of subcortical and ventricular neuroanatomy in magnetic resonance imagery. We assume that there exist multiple magnetic resonance image (MRI) atlases, each atlas containing a collection of locally-defined charts in the brain generated via manual delineation of the structures of interest. We focus on maximum a posteriori estimation of high dimensional segmentations of MR within the class of generative models representing the observed MRI as a conditionally Gaussian random field, conditioned on the atlas charts and the diffeomorphic change of coordinates of each chart that generates it. The charts and their diffeomorphic correspondences are unknown and viewed as latent or hidden variables. We demonstrate that the expectation-maximization (EM) algorithm arises naturally, yielding the likelihood-fusion equation which the a posteriori estimator of the segmentation labels maximizes. The likelihoods being fused are modeled as conditionally Gaussian random fields with mean fields a function of each atlas chart under its diffeomorphic change of coordinates onto the target. The conditional-mean in the EM algorithm specifies the convex weights with which the chart-specific likelihoods are fused. The multiple atlases with the associated convex weights imply that the posterior distribution is a multi-modal representation of the measured MRI. Segmentation results for subcortical and ventricular structures of subjects, within populations of demented subjects, are demonstrated, including the use of multiple atlases across multiple diseased groups.
Typical data visualizations result from linear pipelines that start by characterizing data using a model or algorithm to reduce the dimension and summarize structure, and end by displaying the data in a reduced dimensional form. Sensemaking may take place at the end of the pipeline when users have an opportunity to observe, digest, and internalize any information displayed. However, some visualizations mask meaningful data structures when model or algorithm constraints (e.g., parameter specifications) contradict information in the data. Yet, due to the linearity of the pipeline, users do not have a natural means to adjust the displays. In this paper, we present a framework for creating dynamic data displays that rely on both mechanistic data summaries and expert judgement. The key is that we develop both the theory and methods of a new human-data interaction to which we refer as “ Visual to Parametric Interaction” (V2PI). With V2PI, the pipeline becomes bi-directional in that users are embedded in the pipeline; users learn from visualizations and the visualizations adjust to expert judgement. We demonstrate the utility of V2PI and a bi-directional pipeline with two examples.
The application of Preventive Maintenance (PM) and Statistical Process Control (SPC) are important practices to achieve high product quality, small frequency of failures, and cost reduction in a production process. However there are some points that have not been explored in depth about its joint application. First, most SPC is performed with the X-bar control chart which does not fully consider the variability of the production process. Second, many studies of design of control charts consider just the economic aspect while statistical restrictions must be considered to achieve charts with low probabilities of false detection of failures. Third, the effect of PM on processes with different failure probability distributions has not been studied. Hence, this paper covers these points, presenting the Economic Statistical Design (ESD) of joint X-bar-S control charts with a cost model that integrates PM with general failure distribution. Experiments showed statistically significant reductions in costs when PM is performed on processes with high failure rates and reductions in the sampling frequency of units for testing under SPC.
Sample size calculations are an important part of research to balance the use of resources and to avoid undue harm to participants. Effect sizes are an integral part of these calculations and meaningful values are often unknown to the researcher. General recommendations for effect sizes have been proposed for several commonly used statistical procedures. For the analysis of tables, recommendations have been given for the correlation coefficient for binary data; however, it is well known that suffers from poor statistical properties. The odds ratio is not problematic, although recommendations based on objective reasoning do not exist. This paper proposes odds ratio recommendations that are anchored to for fixed marginal probabilities. It will further be demonstrated that the marginal assumptions can be relaxed resulting in more general results.
Artisanal fisheries in the Mediterranean, especially in Italy, have been poorly investigated. There is a long history of fishing in this region, and it remains an important economic activity in many localities. Our research entails both a comprehensive review of the relevant literature and 58 field interviews with practitioners on plants used in fishing activities along the Western Mediterranean Italian coastal regions. The aims were to record traditional knowledge on plants used in fishery in these regions and to define selection criteria for plant species used in artisanal fisheries, considering ecology and intrinsic properties of plants, and to discuss the pattern of diffusion of shared uses in these areas.
Information was gathered both from a general review of ethnobotanical literature and from original data. A total of 58 semi-structured interviews were carried out in Liguria, Latium, Campania and Sicily (Italy). Information on plant uses related to fisheries were collected and analyzed through a chi-square residual analysis and the correspondence analysis in relation to habitat, life form and chorology.
A total of 60 plants were discussed as being utilized in the fisheries of the Western Italian Mediterranean coastal regions, with 141 different uses mentioned. Of these 141 different uses, 32 are shared among different localities. A multivariate statistical analysis was performed on the entire dataset, resulting in details about specific selection criteria for the different usage categories (plants have different uses that can be classified into 11 main categories). In some uses, species are selected for their features (e.g., woody), or habitat (e.g., riverine), etc. The majority of uses were found to be obsolete (42%) and interviews show that traditional fishery knowledge is in decline. There are several reasons for this, such as climatic change, costs, reduction of fish stocks, etc.
Our research correlates functional characteristics of the plants used in artisanal fishery and habitats, and discusses the distribution of these uses. This research is the first comprehensive outline of plant role in artisanal fisheries and traditional fishery knowledge in the Mediterranean, specifically in Italy.
Ethnobotany; Traditional ecological knowledge; Traditional fishery knowledge
An important problem in reproductive medicine is deciding when people who have failed to become pregnant without medical assistance should begin investigation and treatment. This study describes a computational approach to determining what can be deduced about a couple's future chances of pregnancy from the number of menstrual cycles over which they have been trying to conceive. The starting point is that a couple's fertility is inherently uncertain. This uncertainty is modelled as a probability distribution for the chance of conceiving in each menstrual cycle. We have developed a general numerical computational method, which uses Bayes' theorem to generate a posterior distribution for a couple's chance of conceiving in each cycle, conditional on the number of previous cycles of attempted conception. When various metrics of a couple's expected chances of pregnancy were computed as a function of the number of cycles over which they had been trying to conceive, we found good fits to observed data on time to pregnancy for different populations. The commonly-used standard of 12 cycles of non-conception as an indicator of subfertility was found to be reasonably robust, though a larger or smaller number of cycles may be more appropriate depending on the population from which a couple is drawn and the precise subfertility metric which is most relevant, for example the probability of conception in the next cycle or the next 12 cycles. We have also applied our computational method to model the impact of female reproductive ageing. Results indicate that, for women over the age of 35, it may be appropriate to start investigation and treatment more quickly than for younger women. Ignoring reproductive decline during the period of attempted conception added up to two cycles to the computed number of cycles before reaching a metric of subfertility.
Combining information from various image features has become a standard technique in concept recognition tasks. However, the optimal way of fusing the resulting kernel functions is usually unknown in practical applications. Multiple kernel learning (MKL) techniques allow to determine an optimal linear combination of such similarity matrices. Classical approaches to MKL promote sparse mixtures. Unfortunately, 1-norm regularized MKL variants are often observed to be outperformed by an unweighted sum kernel. The main contributions of this paper are the following: we apply a recently developed non-sparse MKL variant to state-of-the-art concept recognition tasks from the application domain of computer vision. We provide insights on benefits and limits of non-sparse MKL and compare it against its direct competitors, the sum-kernel SVM and sparse MKL. We report empirical results for the PASCAL VOC 2009 Classification and ImageCLEF2010 Photo Annotation challenge data sets. Data sets (kernel matrices) as well as further information are available at http://doc.ml.tu-berlin.de/image_mkl/(Accessed 2012 Jun 25).
Accurate values are a must in medicine. An important parameter in determining the quality of a medical instrument is agreement with a gold standard. Various statistical methods have been used to test for agreement. Some of these methods have been shown to be inappropriate. This can result in misleading conclusions about the validity of an instrument. The Bland-Altman method is the most popular method judging by the many citations of the article proposing this method. However, the number of citations does not necessarily mean that this method has been applied in agreement research. No previous study has been conducted to look into this. This is the first systematic review to identify statistical methods used to test for agreement of medical instruments. The proportion of various statistical methods found in this review will also reflect the proportion of medical instruments that have been validated using those particular methods in current clinical practice.
Five electronic databases were searched between 2007 and 2009 to look for agreement studies. A total of 3,260 titles were initially identified. Only 412 titles were potentially related, and finally 210 fitted the inclusion criteria. The Bland-Altman method is the most popular method with 178 (85%) studies having used this method, followed by the correlation coefficient (27%) and means comparison (18%). Some of the inappropriate methods highlighted by Altman and Bland since the 1980s are still in use.
This study finds that the Bland-Altman method is the most popular method used in agreement research. There are still inappropriate applications of statistical methods in some studies. It is important for a clinician or medical researcher to be aware of this issue because misleading conclusions from inappropriate analyses will jeopardize the quality of the evidence, which in turn will influence quality of care given to patients in the future.
Markov modeling provides an effective approach for modeling ion channel kinetics. There are several search algorithms for global fitting of macroscopic or single-channel currents across different experimental conditions. Here we present a particle swarm optimization(PSO)-based approach which, when used in combination with golden section search (GSS), can fit macroscopic voltage responses with a high degree of accuracy (errors within 1%) and reasonable amount of calculation time (less than 10 hours for 20 free parameters) on a desktop computer. We also describe a method for initial value estimation of the model parameters, which appears to favor identification of global optimum and can further reduce the computational cost. The PSO-GSS algorithm is applicable for kinetic models of arbitrary topology and size and compatible with common stimulation protocols, which provides a convenient approach for establishing kinetic models at the macroscopic level.
Kernel density estimation is a widely used method for estimating a distribution based on a sample of points drawn from that distribution. Generally, in practice some form of error contaminates the sample of observed points. Such error can be the result of imprecise measurements or observation bias. Often this error is negligible and may be disregarded in analysis. In cases where the error is non-negligible, estimation methods should be adjusted to reduce resulting bias. Several modifications of kernel density estimation have been developed to address specific forms of errors. One form of error that has not yet been addressed is the case where observations are nominally placed at the centers of areas from which the points are assumed to have been drawn, where these areas are of varying sizes. In this scenario, the bias arises because the size of the error can vary among points and some subset of points can be known to have smaller error than another subset or the form of the error may change among points. This paper proposes a “contingent kernel density estimation” technique to address this form of error. This new technique adjusts the standard kernel on a point-by-point basis in an adaptive response to changing structure and magnitude of error. In this paper, equations for our contingent kernel technique are derived, the technique is validated using numerical simulations, and an example using the geographic locations of social networking users is worked to demonstrate the utility of the method.
The problem of genotyping polyploids is extremely important for the creation of genetic maps and assembly of complex plant genomes. Despite its significance, polyploid genotyping still remains largely unsolved and suffers from a lack of statistical formality. In this paper a graphical Bayesian model for SNP genotyping data is introduced. This model can infer genotypes even when the ploidy of the population is unknown. We also introduce an algorithm for finding the exact maximum a posteriori genotype configuration with this model. This algorithm is implemented in a freely available web-based software package SuperMASSA. We demonstrate the utility, efficiency, and flexibility of the model and algorithm by applying them to two different platforms, each of which is applied to a polyploid data set: Illumina GoldenGate data from potato and Sequenom MassARRAY data from sugarcane. Our method achieves state-of-the-art performance on both data sets and can be trivially adapted to use models that utilize prior information about any platform or species.
Recently, the construction of networks from time series data has gained widespread interest. In this paper, we develop this area further by introducing a network construction procedure for pseudoperiodic time series. We call such networks episode networks, in which an episode corresponds to a temporal interval of a time series, and which defines a node in the network. Our model includes a number of features which distinguish it from current methods. First, the proposed construction procedure is a parametric model which allows it to adapt to the characteristics of the data; the length of an episode being the parameter. As a direct consequence, networks of minimal size containing the maximal information about the time series can be obtained. In this paper, we provide an algorithm to determine the optimal value of this parameter. Second, we employ estimates of mutual information values to define the connectivity structure among the nodes in the network to exploit efficiently the nonlinearities in the time series. Finally, we apply our method to data from electroencephalogram (EEG) experiments and demonstrate that the constructed episode networks capture discriminative information from the underlying time series that may be useful for diagnostic purposes.
Given the expanding availability of scientific data and tools to analyze them, combining different assessments of the same piece of information has become increasingly important for social, biological, and even physical sciences. This task demands, to begin with, a method-independent standard, such as the -value, that can be used to assess the reliability of a piece of information. Good's formula and Fisher's method combine independent -values with respectively unequal and equal weights. Both approaches may be regarded as limiting instances of a general case of combining -values from groups; -values within each group are weighted equally, while weight varies by group. When some of the weights become nearly degenerate, as cautioned by Good, numeric instability occurs in computation of the combined -values. We deal explicitly with this difficulty by deriving a controlled expansion, in powers of differences in inverse weights, that provides both accurate statistics and stable numerics. We illustrate the utility of this systematic approach with a few examples. In addition, we also provide here an alternative derivation for the probability distribution function of the general case and show how the analytic formula obtained reduces to both Good's and Fisher's methods as special cases. A C++ program, which computes the combined -values with equal numerical stability regardless of whether weights are (nearly) degenerate or not, is available for download at our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/CoinedPValues.html.
In this paper, we outline a model of graph (or network) dynamics based on two ingredients. The first ingredient is a Markov chain on the space of possible graphs. The second ingredient is a semi-Markov counting process of renewal type. The model consists in subordinating the Markov chain to the semi-Markov counting process. In simple words, this means that the chain transitions occur at random time instants called epochs. The model is quite rich and its possible connections with algebraic geometry are briefly discussed. Moreover, for the sake of simplicity, we focus on the space of undirected graphs with a fixed number of nodes. However, in an example, we present an interbank market model where it is meaningful to use directed graphs or even weighted graphs.
Several research fields frequently deal with the analysis of diverse classification results of the same entities. This should imply an objective detection of overlaps and divergences between the formed clusters. The congruence between classifications can be quantified by clustering agreement measures, including pairwise agreement measures. Several measures have been proposed and the importance of obtaining confidence intervals for the point estimate in the comparison of these measures has been highlighted. A broad range of methods can be used for the estimation of confidence intervals. However, evidence is lacking about what are the appropriate methods for the calculation of confidence intervals for most clustering agreement measures. Here we evaluate the resampling techniques of bootstrap and jackknife for the calculation of the confidence intervals for clustering agreement measures. Contrary to what has been shown for some statistics, simulations showed that the jackknife performs better than the bootstrap at accurately estimating confidence intervals for pairwise agreement measures, especially when the agreement between partitions is low. The coverage of the jackknife confidence interval is robust to changes in cluster number and cluster size distribution.
Linear trend analysis of time series is standard procedure in many scientific disciplines. If the number of data is large, a trend may be statistically significant even if data are scattered far from the trend line. This study introduces and tests a quality criterion for time trends referred to as statistical meaningfulness, which is a stricter quality criterion for trends than high statistical significance. The time series is divided into intervals and interval mean values are calculated. Thereafter, r2 and p values are calculated from regressions concerning time and interval mean values. If r2≥0.65 at p≤0.05 in any of these regressions, then the trend is regarded as statistically meaningful. Out of ten investigated time series from different scientific disciplines, five displayed statistically meaningful trends. A Microsoft Excel application (add-in) was developed which can perform statistical meaningfulness tests and which may increase the operationality of the test. The presented method for distinguishing statistically meaningful trends should be reasonably uncomplicated for researchers with basic statistics skills and may thus be useful for determining which trends are worth analysing further, for instance with respect to causal factors. The method can also be used for determining which segments of a time trend may be particularly worthwhile to focus on.
To comprehend the hierarchical organization of large integrated systems, we introduce the hierarchical map equation, which reveals multilevel structures in networks. In this information-theoretic approach, we exploit the duality between compression and pattern detection; by compressing a description of a random walker as a proxy for real flow on a network, we find regularities in the network that induce this system-wide flow. Finding the shortest multilevel description of the random walker therefore gives us the best hierarchical clustering of the network — the optimal number of levels and modular partition at each level — with respect to the dynamics on the network. With a novel search algorithm, we extract and illustrate the rich multilevel organization of several large social and biological networks. For example, from the global air traffic network we uncover countries and continents, and from the pattern of scientific communication we reveal more than 100 scientific fields organized in four major disciplines: life sciences, physical sciences, ecology and earth sciences, and social sciences. In general, we find shallow hierarchical structures in globally interconnected systems, such as neural networks, and rich multilevel organizations in systems with highly separated regions, such as road networks.
We introduce a generalization of the well-known ARCH process, widely used for generating uncorrelated stochastic time series with long-term non-Gaussian distributions and long-lasting correlations in the (instantaneous) standard deviation exhibiting a clustering profile. Specifically, inspired by the fact that in a variety of systems impacting events are hardly forgot, we split the process into two different regimes: a first one for regular periods where the average volatility of the fluctuations within a certain period of time is below a certain threshold, , and another one when the local standard deviation outnumbers . In the former situation we use standard rules for heteroscedastic processes whereas in the latter case the system starts recalling past values that surpassed the threshold. Our results show that for appropriate parameter values the model is able to provide fat tailed probability density functions and strong persistence of the instantaneous variance characterized by large values of the Hurst exponent (), which are ubiquitous features in complex systems.
Many real-world networks tend to be very dense. Particular examples of interest arise in the construction of networks that represent pairwise similarities between objects. In these cases, the networks under consideration are weighted, generally with positive weights between any two nodes. Visualization and analysis of such networks, especially when the number of nodes is large, can pose significant challenges which are often met by reducing the edge set. Any effective “sparsification” must retain and reflect the important structure in the network. A common method is to simply apply a hard threshold, keeping only those edges whose weight exceeds some predetermined value. A more principled approach is to extract the multiscale “backbone” of a network by retaining statistically significant edges through hypothesis testing on a specific null model, or by appropriately transforming the original weight matrix before applying some sort of threshold. Unfortunately, approaches such as these can fail to capture multiscale structure in which there can be small but locally statistically significant similarity between nodes. In this paper, we introduce a new method for backbone extraction that does not rely on any particular null model, but instead uses the empirical distribution of similarity weight to determine and then retain statistically significant edges. We show that our method adapts to the heterogeneity of local edge weight distributions in several paradigmatic real world networks, and in doing so retains their multiscale structure with relatively insignificant additional computational costs. We anticipate that this simple approach will be of great use in the analysis of massive, highly connected weighted networks.
This paper explores relationships between classical and parametric measures of graph (or network) complexity. Classical measures are based on vertex decompositions induced by equivalence relations. Parametric measures, on the other hand, are constructed by using information functions to assign probabilities to the vertices. The inequalities established in this paper relating classical and parametric measures lay a foundation for systematic classification of entropy-based measures of graph complexity.