Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.
We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.
Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at .
Genetic mapping has been used as a tool to study the genetic architecture of complex traits by localizing their underlying quantitative trait loci (QTLs). Statistical methods for genetic mapping rely on a key assumption, that is, traits obey a parametric distribution. However, in practice real data may not perfectly follow the specified distribution.
Here, we derive a robust statistical approach for QTL mapping that accommodates a certain degree of misspecification of the true model by incorporating integrated square errors into the genetic mapping framework. A hypothesis testing is formulated by defining a new test statistics - energy difference.
Simulation studies were performed to investigate the statistical properties of this approach and compare these properties with those from traditional maximum likelihood and non-parametric QTL mapping approaches. Lastly, analyses of real examples were conducted to demonstrate the usefulness and utilization of the new approach in a practical genetic setting.
The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient p-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the p-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100–500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small p-value (e.g. less than 10 − 6). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.
Bootstrap procedures; Genetic association studies; p-value; Resampling-based tests; Stochastic approximation Markov chain Monte Carlo
Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart.
Genome-wide investigations for identifying the genes for complex traits are considered to be agnostic in terms of prior assumptions for the responsible DNA alterations. The agreement of genome-wide association studies (GWAS) and genome-wide linkage scans (GWLS) has not been explored to date. In this study, a genomic convergence approach of GWAS and GWLS was implemented for the first time in order to identify genomic loci supported by both methods. A database with 376 GWLS and 102 GWAS for 19 complex traits was created. Data regarding the location and statistical significance for each genetic marker were extracted from articles or web-based databases. Convergence was quantified as the proportion of significant GWAS markers located within linked regions. Convergence was variable (0–73.3%) and was found to be significantly higher than expected by chance only for two of the 19 phenotypes. Seventy five loci of interest were identified, which being supported by independent lines of evidence, could merit prioritization in future investigations. Although convergence is supportive of genuine effects, lack of agreement between GWLS and GWAS is also indicative that these studies are designed to answer different questions and are not equally well suited for deciphering the genetics of complex traits.
genomic convergence; linkage; association; complex traits
Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II error rates; however, real microarray data do not always fit their distribution assumptions. Smyth's ubiquitous parametric method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed. Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the Smyth's parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods. Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower false discovery rates compared to Smyth's parametric method when data are not normally distributed. The Resampling-based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data analysis.
Significance of genetic association to a marker has been traditionally evaluated through statistics that are standardized such that their null distributions conform to some known ones. Distributional assumptions are often required in this standardization procedure. Based on the observation that the phenotype remains the same regardless of the marker being investigated, we propose a simple statistic that does not need such standardization. We propose a resampling procedure to assess this statistic’s genome-wide significance. This method has been applied to replicate 2 of the Genetic Analysis Workshop 17 simulated data on unrelated individuals in an attempt to map phenotype Q2. However, none of the selected SNPs are in genes that are disease-causing. This may be due to the weak effect that each genetic factor has on Q2.
In eukaryotes, most DNA-binding proteins exert their action as members of large effector complexes. The presence of these complexes are revealed in high-throughput genome-wide assays by the co-occurrence of the binding sites of different complex components. Resampling tests are one route by which the statistical significance of apparent co-occurrence can be assessed.
We have investigated two resampling approaches for evaluating the statistical significance of binding-site co-occurrence. The permutation test approach was found to yield overly favourable p-values while the independent resampling approach had the opposite effect and is of little use in practical terms. We have developed a new, pragmatically-devised hybrid approach that, when applied to the experimental results of an Polycomb/Trithorax study, yielded p-values consistent with the findings of that study. We extended our investigations to the FL method developed by Haiminen et al, which derives its null distribution from all binding sites within a dataset, and show that the p-value computed for a pair of factors by this method can depend on which other factors are included in that dataset. Both our hybrid method and the FL method appeared to yield plausible estimates of the statistical significance of co-occurrences although our hybrid method was more conservative when applied to the Polycomb/Trithorax dataset.
A high-performance parallelized implementation of the hybrid method is available.
We propose a new resampling-based co-occurrence significance test and demonstrate that it performs as well as or better than existing methods on a large experimentally-derived dataset. We believe it can be usefully applied to data from high-throughput genome-wide techniques such as ChIP-chip or DamID. The Cooccur package, which implements our approach, accompanies this paper.
The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true.
In this context we present a semi-parametric approach based on kernel estimators which is applied to different high-throughput biological data such as patterns in DNA sequences, genes expression and genome-wide association studies.
The proposed method has the practical advantages, over existing approaches, to consider complex heterogeneities in the alternative hypothesis, to take into account prior information (from an expert judgment or previous studies) by allowing a semi-supervised mode, and to deal with truncated distributions such as those obtained in Monte-Carlo simulations. This method has been implemented and is available through the R package kerfdr via the CRAN or at .
Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features associated with illness. We propose a new approach, called gene set bagging, for measuring the probability that a gene set replicates in future studies. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate in the bagged samples.
Using both simulated and publicly-available genomics data, we demonstrate that significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. We show our method estimates the replication probability (R), the probability that a gene set will replicate as a significant result in future studies, and show in simulations that this method reflects replication better than each set’s p-value.
Our results suggest that gene lists based on p-values are not necessarily stable, and therefore additional steps like gene set bagging may improve biological inference on gene sets.
Gene set enrichment analysis; Gene expression; DNA methylation; Gene ontology
Tumoral tissues tend to generally exhibit aberrations in DNA copy number that are associated with the development and progression of cancer. Genotyping methods such as array-based comparative genomic hybridization (aCGH) provide means to identify copy number variation across the entire genome. To address some of the shortfalls of existing methods of DNA copy number data analysis, including strong model assumptions, lack of accounting for sampling variability of estimators, and the assumption that clones are independent, we propose a simple graphical approach to assess population-level genetic alterations over the entire genome based on moving average. Furthermore, existing methods primarily focus on segmentation and do not examine the association of covariates with genetic instability. In our methods, covariates are incorporated through a possibly mis-specified working model and sampling variabilities of estimators are approximated using a resampling method that is based on perturbing observed processes. Our proposal, which is applicable to partial, entire or multiple chromosomes, is illustrated through application to aCGH studies of two brain tumor types, meningioma and glioma.
aCGH data; Moving average; Perturbation method; Gaussian process; Genomic data
The Epstein-Barr virus (EBV) double-stranded DNA genome is subject to extensive epigenetic regulation. Large consortiums and individual labs have generated a vast number of genome-wide data sets on human lymphoblastoid and other cell lines latently infected with EBV. Analysis of these data sets reveals important new information on the properties of the host and viral chromosome structure organization and epigenetic modifications. We discuss the mapping of these data sets and the subsequent insights into the chromatin structure and transcription factor binding patterns on latent EBV genomes. Colocalization of multiple histone modifications and transcription factors at regulatory loci are considered in the context of the biology and regulation of EBV.
Epstein-Barr virus; gammaherpesvirus; chromatin; histone modification; CTCF; OriP
Community detection helps us simplify the complex configuration of networks, but communities are reliable only if they are statistically significant. To detect statistically significant communities, a common approach is to resample the original network and analyze the communities. But resampling assumes independence between samples, while the components of a network are inherently dependent. Therefore, we must understand how breaking dependencies between resampled components affects the results of the significance analysis. Here we use scientific communication as a model system to analyze this effect. Our dataset includes citations among articles published in journals in the years 1984–2010. We compare parametric resampling of citations with non-parametric article resampling. While citation resampling breaks link dependencies, article resampling maintains such dependencies. We find that citation resampling underestimates the variance of link weights. Moreover, this underestimation explains most of the differences in the significance analysis of ranking and clustering. Therefore, when only link weights are available and article resampling is not an option, we suggest a simple parametric resampling scheme that generates link-weight variances close to the link-weight variances of article resampling. Nevertheless, when we highlight and summarize important structural changes in science, the more dependencies we can maintain in the resampling scheme, the earlier we can predict structural change.
Fluoroquinolones are an important class of antibiotics for the treatment of infections arising from the gram-positive respiratory pathogen Streptococcus pneumoniae. Although there is evidence supporting interspecific lateral DNA transfer of fluoroquinolone target loci, no studies have specifically been designed to assess the role of intraspecific lateral transfer of these genes in the spread of fluoroquinolone resistance. This study involves a comparative evolutionary perspective, in which the evolutionary history of a diverse set of S. pneumoniae clinical isolates is reconstructed from an expanded multilocus sequence typing data set, with putative recombinants excluded. This control history is then assessed against networks of each of the four fluoroquinolone target loci from the same isolates. The results indicate that although the majority of fluoroquinolone target loci from this set of 60 isolates are consistent with a clonal dissemination hypothesis, 3 to 10% of the sequences are consistent with an intraspecific lateral transfer hypothesis. Also evident were examples of interspecific transfer, with two isolates possessing a parE-parC gene region arising from viridans group streptococci. The Spain 23F-1 clone is the most dominant fluoroquinolone-nonsusceptible clone in this set of isolates, and the analysis suggests that its members act as frequent donors of fluoroquinolone-nonsusceptible loci. Although the majority of fluoroquinolone target gene sequences in this set of isolates can be explained on the basis of clonal dissemination, a significant number are more parsimoniously explained by intraspecific lateral DNA transfer, and in situations of high S. pneumoniae population density, such events could be an important means of resistance spread.
Genes for complex disorders have proven hard to find using linkage analysis. The results rarely reach the desired level of significance and researchers often have failed to replicate positive findings. There is, however, a wealth of information from other scientific approaches which enables the formation of hypotheses on groups of genes or genomic regions likely to be enriched in disease loci. Examples include genes belonging to specific pathways or producing proteins interacting with known risk factors, genes that show altered expression levels in patients or even the group of top scoring locations in a linkage study. We show here that this hypothesis of enrichment for disease loci can be tested using genome-wide linkage data, provided that these data are independent from the data used to generate the hypothesis. Our method is based on the fact that non-parametric linkage analyses are expected to show increased scores at each one of the disease loci, although this increase might not rise above the noise of stochastic variation. By using a summary statistic and calculating its empirical significance, we show that enrichment hypotheses can be tested with power higher than the power of the linkage scan data to identify individual loci. Via simulated linkage scans for a number of different models, we gain insight in the interpretation of genome scan results and test the power of our proposed method. We present an application of the method to real data from a late-onset Alzheimer's disease linkage scan as a proof of principle.
linkage; genome scan; complex disorder; genes; group
One mechanism by which disease-associated DNA variation can alter disease risk is altering gene expression. However, linkage disequilibrium (LD) between variants, mostly single-nucleotide polymorphisms (SNPs), means it is not sufficient to show that a particular variant associates with both disease and expression, as there could be two distinct causal variants in LD. Here, we describe a formal statistical test of colocalization and apply it to type 1 diabetes (T1D)-associated regions identified mostly through genome-wide association studies and expression quantitative trait loci (eQTLs) discovered in a recently determined large monocyte expression data set from the Gutenberg Health Study (1370 individuals), with confirmation sought in an additional data set from the Cardiogenics Transcriptome Study (558 individuals). We excluded 39 out of 60 overlapping eQTLs in 49 T1D regions from possible colocalization and identified 21 coincident eQTLs, representing 21 genes in 14 distinct T1D regions. Our results reflect the importance of monocyte (and their derivatives, macrophage and dendritic cell) gene expression in human T1D and support the candidacy of several genes as causal factors in autoimmune pancreatic beta-cell destruction, including AFF3, CD226, CLECL1, DEXI, FKRP, PRKD2, RNLS, SMARCE1 and SUOX, in addition to the recently described GPR183 (EBI2) gene.
Two- or three-dimensional wavelet transforms have been considered as a basis for multiple hypothesis testing of parametric maps derived from functional magnetic resonance imaging (fMRI) experiments. Most of the previous approaches have assumed that the noise variance is equally distributed across levels of the transform. Here we show that this assumption is unrealistic; fMRI parameter maps typically have more similarity to a 1/f-type spatial covariance with greater variance in 2D wavelxet coefficients representing lower spatial frequencies, or coarser spatial features, in the maps. To address this issue we resample the fMRI time series data in the wavelet domain (using a 1D discrete wavelet transform [DWT]) to produce a set of permuted parametric maps that are decomposed (using a 2D DWT) to estimate level-specific variances of the 2D wavelet coefficients under the null hypothesis. These resampling-based estimates of the “wavelet variance spectrum” are substituted in a Bayesian bivariate shrinkage operator to denoise the observed 2D wavelet coefficients, which are then inverted to reconstitute the observed, denoised map in the spatial domain. Multiple hypothesis testing controlling the false discovery rate in the observed, denoised maps then proceeds in the spatial domain, using thresholds derived from an independent set of permuted, denoised maps. We show empirically that this more realistic, resampling-based algorithm for wavelet-based denoising and multiple hypothesis testing has good Type I error control and can detect experimentally engendered signals in data acquired during auditory-linguistic processing.
Bayes; multiple comparisons; nonparametric; permutation; wavelets
Non-parametric bootstrapping is a widely-used statistical procedure for assessing confidence of model parameters based on the empirical distribution of the observed data  and, as such, it has become a common method for assessing tree confidence in phylogenetics . Traditional non-parametric bootstrapping does not weigh each tree inferred from resampled (i.e., pseudo-replicated) sequences. Hence, the quality of these trees is not taken into account when computing bootstrap scores associated with the clades of the original phylogeny. As a consequence, traditionally, the trees with different bootstrap support or those providing a different fit to the corresponding pseudo-replicated sequences (the fit quality can be expressed through the LS, ML or parsimony score) contribute in the same way to the computation of the bootstrap support of the original phylogeny.
In this article, we discuss the idea of applying weighted bootstrapping to phylogenetic reconstruction by weighting each phylogeny inferred from resampled sequences. Tree weights can be based either on the least-squares (LS) tree estimate or on the average secondary bootstrap score (SBS) associated with each resampled tree. Secondary bootstrapping consists of the estimation of bootstrap scores of the trees inferred from resampled data. The LS and SBS-based bootstrapping procedures were designed to take into account the quality of each "pseudo-replicated" phylogeny in the final tree estimation. A simulation study was carried out to evaluate the performances of the five weighting strategies which are as follows: LS and SBS-based bootstrapping, LS and SBS-based bootstrapping with data normalization and the traditional unweighted bootstrapping.
The simulations conducted with two real data sets and the five weighting strategies suggest that the SBS-based bootstrapping with the data normalization usually exhibits larger bootstrap scores and a higher robustness compared to the four other competing strategies, including the traditional bootstrapping. The high robustness of the normalized SBS could be particularly useful in situations where observed sequences have been affected by noise or have undergone massive insertion or deletion events. The results provided by the four other strategies were very similar regardless the noise level, thus also demonstrating the stability of the traditional bootstrapping method.
Initiation of DNA replication in higher eukaryotes is still a matter of controversy. Some evidence suggests it occurs at specific sites. Data obtained using two-dimensional (2D) agarose gel electrophoresis, however, led to the notion that it may occur at random in broad zones. This hypothesis is primarily based on the observation that several contiguous DNA fragments generate a mixture of the so-called 'bubble' and 'simple Y' patterns in Neutral/neutral 2D gels. The interpretation that this mixture of hybridisation patterns is indicative for random initiation of DNA synthesis relies on the assumption that replicative intermediates (RIs) containing an internal bubble where initiation occurred at different relative positions, generate comigrating signals. The latter, however, is still to be proven. We investigated this problem by analysing together, in the same 2D gel, populations of pBR322 RIs that were digested with different restriction endonucleases that cut the monomer only once at different locations. DNA synthesis begins at a specific site in pBR322 and progresses in a uni-directional manner. Thus, the main difference between these sets of RIs was the relative position of the origin. The results obtained clearly showed that populations of RIs containing an internal bubble where initiation occurred at different relative positions do not generate signals that co-migrate all-the-way in 2D gels. Despite this observation, however, our results support the notion that random initiation is indeed responsible for the peculiar 'bubble' signal observed in the case of several metazoan eukaryotes.
We consider the problem of evaluating a statistical hypothesis when some model characteristics are non-identifiable from observed data. Such scenario is common in meta-analysis for assessing publication bias and in longitudinal studies for evaluating a covariate effect when dropouts are likely to be non-ignorable. One possible approach to this problem is to fix a minimal set of sensitivity parameters conditional upon which hypothesized parameters are identifiable. Here, we extend this idea and show how to evaluate the hypothesis of interest using an infimum statistic over the whole support of the sensitivity parameter. We characterize the limiting distribution of the statistic as a process in the sensitivity parameter, which involves a careful theoretical analysis of its behavior under model misspecification. In practice, we suggest a nonparametric bootstrap procedure to implement this infimum test as well as to construct confidence bands for simultaneous pointwise tests across all values of the sensitivity parameter, adjusting for multiple testing. The methodology’s practical utility is illustrated in an analysis of a longitudinal psychiatric study.
Data missing not at random; Functional estimators; Infimum statistic; Model misspecification; Nonparametric bootstrap; Profile estimators; Sensitivity analysis; Uniform convergence
Much forensic inference based upon DNA evidence is made assuming Hardy-Weinberg Equilibrium (HWE) for the genetic loci being used. Several statistical tests to detect and measure deviation from HWE have been devised, and their limitations become more obvious when testing for deviation within multiallelic DNA loci. The most popular methods-Chi-square and Likelihood-ratio tests-are based on asymptotic results and cannot guarantee a good performance in the presence of low frequency genotypes. Since the parameter space dimension increases at a quadratic rate on the number of alleles, some authors suggest applying sequential methods, where the multiallelic case is reformulated as a sequence of “biallelic” tests. However, in this approach it is not obvious how to assess the general evidence of the original hypothesis; nor is it clear how to establish the significance level for its acceptance/rejection. In this work, we introduce a straightforward method for the multiallelic HWE test, which overcomes the aforementioned issues of sequential methods. The core theory for the proposed method is given by the Full Bayesian Significance Test (FBST), an intuitive Bayesian approach which does not assign positive probabilities to zero measure sets when testing sharp hypotheses. We compare FBST performance to Chi-square, Likelihood-ratio and Markov chain tests, in three numerical experiments. The results suggest that FBST is a robust and high performance method for the HWE test, even in the presence of several alleles and small sample sizes.
Hardy-Weinberg equilibrium; significance tests; FBST
Metabolic fingerprinting studies rely on interpretations drawn from low-dimensional representations of spectral data generated by methods of multivariate analysis such as PCA and PLS-DA. The growth of metabolic fingerprinting and chemometric analyses involving these low-dimensional scores plots necessitates the use of quantitative statistical measures to describe significant differences between experimental groups. Our updated version of the PCAtoTree software provides methods to reliably visualize and quantify separations in scores plots through dendrograms employing both nonparametric and parametric hypothesis testing to assess node significance, as well as scores plots identifying 95% confidence ellipsoids for all experimental groups.
PCA; PLS-DA; MVA; UPGMA; Hierarchical clustering; Bootstrapping; Statistical hypothesis testing; Mahalanobis distance; Hotelling T2 distribution; Metabolomics
Large-scale sequencing of genomes has enabled the inference of phylogenies based on the evolution of genomic architecture, under such events as rearrangements, duplications, and losses. Many evolutionary models and associated algorithms have been designed over the last few years and have found use in comparative genomics and phylogenetic inference. However, the assessment of phylogenies built from such data has not been properly addressed to date. The standard method used in sequence-based phylogenetic inference is the bootstrap, but it relies on a large number of homologous characters that can be resampled; yet in the case of rearrangements, the entire genome is a single character. Alternatives such as the jackknife suffer from the same problem, while likelihood tests cannot be applied in the absence of well established probabilistic models.
We present a new approach to the assessment of distance-based phylogenetic inference from whole-genome data; our approach combines features of the jackknife and the bootstrap and remains nonparametric. For each feature of our method, we give an equivalent feature in the sequence-based framework; we also present the results of extensive experimental testing, in both sequence-based and genome-based frameworks. Through the feature-by-feature comparison and the experimental results, we show that our bootstrapping approach is on par with the classic phylogenetic bootstrap used in sequence-based reconstruction, and we establish the clear superiority of the classic bootstrap for sequence data and of our corresponding new approach for rearrangement data over proposed variants. Finally, we test our approach on a small dataset of mammalian genomes, verifying that the support values match current thinking about the respective branches.
Our method is the first to provide a standard of assessment to match that of the classic phylogenetic bootstrap for aligned sequences. Its support values follow a similar scale and its receiver-operating characteristics are nearly identical, indicating that it provides similar levels of sensitivity and specificity. Thus our assessment method makes it possible to conduct phylogenetic analyses on whole genomes with the same degree of confidence as for analyses on aligned sequences. Extensions to search-based inference methods such as maximum parsimony and maximum likelihood are possible, but remain to be thoroughly tested.
Bootstrap; Jackknife; Phylogenetic reconstruction; Rearrangement; Gene order; Comparative genomics
In this paper, we carry out an in-depth investigation of diagnostic measures for assessing the influence of observations and model misspecification in the presence of missing covariate data for generalized linear models. Our diagnostic measures include case-deletion measures and conditional residuals. We use the conditional residuals to construct goodness-of-fit statistics for testing possible misspecifications in model assumptions, including the sampling distribution. We develop specific strategies for incorporating missing data into goodness-of-fit statistics in order to increase the power of detecting model misspecification. A resampling method is proposed to approximate the p-value of the goodness-of-fit statistics. Simulation studies are conducted to evaluate our methods and a real data set is analysed to illustrate the use of our various diagnostic measures.
Cook’s distance; goodness-of-fit; missing covariates; residuals
There are currently a number of competing techniques for low-level processing of oligonucleotide array data. The choice of technique has a profound effect on subsequent statistical analyses, but there is no method to assess whether a particular technique is appropriate for a specific data set, without reference to external data.
We analyzed coregulation between genes in order to detect insufficient normalization between arrays, where coregulation is measured in terms of statistical correlation. In a large collection of genes, a random pair of genes should have on average zero correlation, hence allowing a correlation test. For all data sets that we evaluated, and the three most commonly used low-level processing procedures including MAS5, RMA and MBEI, the housekeeping-gene normalization failed the test. For a real clinical data set, RMA and MBEI showed significant correlation for absent genes. We also found that a second round of normalization on the probe set level improved normalization significantly throughout.
Previous evaluation of low-level processing in the literature has been limited to artificial spike-in and mixture data sets. In the absence of a known gold-standard, the correlation criterion allows us to assess the appropriateness of low-level processing of a specific data set and the success of normalization for subsets of genes.