Accuracy in the diagnosis of breast cancer and classification of cancer subtypes has improved over the years with the development of well-established immunohistopathological criteria. More recently, diagnostic gene-sets at the mRNA expression level have been tested as better predictors of disease state. However, breast cancer is heterogeneous in nature; thus extraction of differentially expressed gene-sets that stably distinguish normal tissue from various pathologies poses challenges. Meta-analysis of high-throughput expression data using a collection of statistical methodologies leads to the identification of robust tumor gene expression signatures.
A resampling-based meta-analysis strategy, which involves the use of resampling and application of distribution statistics in combination to assess the degree of significance in differential expression between sample classes, was developed. Two independent microarray datasets that contain normal breast, invasive ductal carcinoma (IDC), and invasive lobular carcinoma (ILC) samples were used for the meta-analysis. Expression of the genes, selected from the gene list for classification of normal breast samples and breast tumors encompassing both the ILC and IDC subtypes were tested on 10 independent primary IDC samples and matched non-tumor controls by real-time qRT-PCR. Other existing breast cancer microarray datasets were used in support of the resampling-based meta-analysis.
The two independent microarray studies were found to be comparable, although differing in their experimental methodologies (Pearson correlation coefficient, R = 0.9389 and R = 0.8465 for ductal and lobular samples, respectively). The resampling-based meta-analysis has led to the identification of a highly stable set of genes for classification of normal breast samples and breast tumors encompassing both the ILC and IDC subtypes. The expression results of the selected genes obtained through real-time qRT-PCR supported the meta-analysis results.
The proposed meta-analysis approach has the ability to detect a set of differentially expressed genes with the least amount of within-group variability, thus providing highly stable gene lists for class prediction. Increased statistical power and stringent filtering criteria used in the present study also make identification of novel candidate genes possible and may provide further insight to improve our understanding of breast cancer development.
The number of available algorithms to infer a biological network from a dataset of high-throughput measurements is overwhelming and keeps growing. However, evaluating their performance is unfeasible unless a ‘gold standard’ is available to measure how close the reconstructed network is to the ground truth. One measure of this is the stability of these predictions to data resampling approaches. We introduce NetSI, a family of Network Stability Indicators, to assess quantitatively the stability of a reconstructed network in terms of inference variability due to data subsampling. In order to evaluate network stability, the main NetSI methods use a global/local network metric in combination with a resampling (bootstrap or cross-validation) procedure. In addition, we provide two normalized variability scores over data resampling to measure edge weight stability and node degree stability, and then introduce a stability ranking for edges and nodes. A complete implementation of the NetSI indicators, including the Hamming-Ipsen-Mikhailov (HIM) network distance adopted in this paper is available with the R package nettools. We demonstrate the use of the NetSI family by measuring network stability on four datasets against alternative network reconstruction methods. First, the effect of sample size on stability of inferred networks is studied in a gold standard framework on yeast-like data from the Gene Net Weaver simulator. We also consider the impact of varying modularity on a set of structurally different networks (50 nodes, from 2 to 10 modules), and then of complex feature covariance structure, showing the different behaviours of standard reconstruction methods based on Pearson correlation, Maximum Information Coefficient (MIC) and False Discovery Rate (FDR) strategy. Finally, we demonstrate a strong combined effect of different reconstruction methods and phenotype subgroups on a hepatocellular carcinoma miRNA microarray dataset (240 subjects), and we validate the analysis on a second dataset (166 subjects) with good reproducibility.
Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II error rates; however, real microarray data do not always fit their distribution assumptions. Smyth's ubiquitous parametric method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed. Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the Smyth's parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods. Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower false discovery rates compared to Smyth's parametric method when data are not normally distributed. The Resampling-based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data analysis.
Community detection helps us simplify the complex configuration of networks, but communities are reliable only if they are statistically significant. To detect statistically significant communities, a common approach is to resample the original network and analyze the communities. But resampling assumes independence between samples, while the components of a network are inherently dependent. Therefore, we must understand how breaking dependencies between resampled components affects the results of the significance analysis. Here we use scientific communication as a model system to analyze this effect. Our dataset includes citations among articles published in journals in the years 1984–2010. We compare parametric resampling of citations with non-parametric article resampling. While citation resampling breaks link dependencies, article resampling maintains such dependencies. We find that citation resampling underestimates the variance of link weights. Moreover, this underestimation explains most of the differences in the significance analysis of ranking and clustering. Therefore, when only link weights are available and article resampling is not an option, we suggest a simple parametric resampling scheme that generates link-weight variances close to the link-weight variances of article resampling. Nevertheless, when we highlight and summarize important structural changes in science, the more dependencies we can maintain in the resampling scheme, the earlier we can predict structural change.
Gene class, ontology, or pathway testing analysis has become increasingly popular in microarray data analysis. Such approaches allow the integration of gene annotation databases, such as Gene Ontology and KEGG Pathway, to formally test for subtle but coordinated changes at a system level. Higher power in gene class testing is gained by combining weak signals from a number of individual genes in each pathway. We propose an alternative approach for gene-class testing based on mixed models, a class of statistical models that:
provides the ability to model and borrow strength across genes that are both up and down in a pathway,
operates within a well-established statistical framework amenable to direct control of false positive or false discovery rates,
exhibits improved power over widely used methods under normal location-based alternative hypotheses, and
handles complex experimental designs for which permutation resampling is difficult.
We compare the properties of this mixed models approach with nonparametric method GSEA and parametric method PAGE using a simulation study, and illustrate its application with a diabetes data set and a dose-response data set.
In microarray data analysis, when statistical testing is applied to each gene individually, one is often left with too many significant genes that are difficult to interpret or too few genes after a multiple comparison adjustment. Gene-class, or pathway-level testing, integrates gene annotation data such as Gene Ontology and tests for coordinated changes at the system level. These approaches can both increase power for detecting differential expression and allow for better understanding of the underlying biological processes associated with variations in outcome. We propose an alternative pathway analysis method based on mixed models, and show this method provides useful inferences beyond those available in currently popular methods, with improved power and the ability to handle complex experimental designs.
Michiels et al. (Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forward as a 'gold standard'. On a higher level, breast cancer datasets collected by different institutions can be considered as resamplings from the underlying breast cancer population. The limited overlap between published prognostic signatures confirms the trend of signature instability identified by the resampling strategy. Six breast cancer datasets, totaling 947 samples, all measured on the Affymetrix platform, are currently available. This provides a unique opportunity to employ a substantial dataset to investigate the effects of pooling datasets on classifier accuracy, signature stability and enrichment of functional categories.
We show that the resampling strategy produces a suboptimal ranking of genes, which can not be considered to be a 'gold standard'. When pooling breast cancer datasets, we observed a synergetic effect on the classification performance in 73% of the cases. We also observe a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. In addition, we have evaluated the support for five explanations that have been postulated for the limited overlap of signatures.
The limited overlap of current signature genes can be attributed to small sample size. Pooling datasets results in more accurate classification and a convergence of signature genes. We therefore advocate the analysis of new data within the context of a compendium, rather than analysis in isolation.
Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.
We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.
Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at .
Combinatorial gene perturbations provide rich information for a systematic exploration of genetic interactions. Despite successful applications to bacteria and yeast, the scalability of this approach remains a major challenge for higher organisms such as humans. Here, we report a novel experimental and computational framework to efficiently address this challenge by limiting the ‘search space’ for important genetic interactions. We propose to integrate rich phenotypes of multiple single gene perturbations to robustly predict functional modules, which can subsequently be subjected to further experimental investigations such as combinatorial gene silencing. We present posterior association networks (PANs) to predict functional interactions between genes estimated using a Bayesian mixture modelling approach. The major advantage of this approach over conventional hypothesis tests is that prior knowledge can be incorporated to enhance predictive power. We demonstrate in a simulation study and on biological data, that integrating complementary information greatly improves prediction accuracy. To search for significant modules, we perform hierarchical clustering with multiscale bootstrap resampling. We demonstrate the power of the proposed methodologies in applications to Ewing's sarcoma and human adult stem cells using publicly available and custom generated data, respectively. In the former application, we identify a gene module including many confirmed and highly promising therapeutic targets. Genes in the module are also significantly overrepresented in signalling pathways that are known to be critical for proliferation of Ewing's sarcoma cells. In the latter application, we predict a functional network of chromatin factors controlling epidermal stem cell fate. Further examinations using ChIP-seq, ChIP-qPCR and RT-qPCR reveal that the basis of their genetic interactions may arise from transcriptional cross regulation. A Bioconductor package implementing PAN is freely available online at http://bioconductor.org/packages/release/bioc/html/PANR.html.
Synthetic genetic interactions estimated from combinatorial gene perturbation screens provide systematic insights into synergistic interactions of genes in a biological process. However, this approach lacks scalability for large-scale genetic interaction profiling in metazoan organisms such as humans. We contribute to this field by proposing a more scalable and affordable approach, which takes the advantage of multiple single gene perturbation data to predict coherent functional modules followed by genetic interaction investigation using combinatorial perturbations. We developed a versatile computational framework (PAN) to robustly predict functional interactions and search for significant functional modules from rich phenotyping screens of single gene perturbations under different conditions or from multiple cell lines. PAN features a Bayesian mixture model to assess statistical significance of functional associations, the capability to incorporate prior knowledge as well as a generalized approach to search for significant functional modules by multiscale bootstrap resampling. In applications to Ewing's sarcoma and human adult stem cells, we demonstrate the general applicability and prediction power of PAN to both public and custom generated screening data.
In sub-Saharan Africa, bovine tuberculosis (bTB) is a potential hazard for animals and humans health. The goal of this study was to improve our understanding of bTB epidemiology in Burkina Faso and especially Mycobacterium bovis transmission within and between the bovine and human populations.
Twenty six M. bovis strains were isolated from 101 cattle carcasses with suspected bTB lesions during routine meat inspections at the Bobo Dioulasso and Ouagadougou slaughterhouses. In addition, 7 M. bovis strains were isolated from 576 patients with pulmonary tuberculosis. Spoligotyping, RDAf1 deletion and MIRU-VNTR typing were used for strains genotyping. The isolation of M. bovis strains was confirmed by spoligotyping and 12 spoligotype signatures were detected. Together, the spoligotyping and MIRU-VNTR data allowed grouping the 33 M. bovis isolates in seven clusters including isolates exclusively from cattle (5) or humans (1) or from both (1). Moreover, these data (genetic analyses and phenetic tree) showed that the M. bovis isolates belonged to the African 1 (Af1) clonal complex (81.8%) and the putative African 5 (Af5) clonal complex (18.2%), in agreement with the results of RDAf1 deletion typing.
This is the first detailed molecular characterization of M. bovis strains from humans and cattle in Burkina Faso. The distribution of the two Af1 and putative Af5 clonal complexes is comparable to what has been reported in neighbouring countries. Furthermore, the strain genetic profiles suggest that M. bovis circulates across the borders and that the Burkina Faso strains originate from different countries, but have a country-specific evolution. The genetic characterization suggests that, currently, M. bovis transmission occurs mainly between cattle, occasionally between cattle and humans and potentially between humans. This study emphasizes the bTB risk in cattle but also in humans and the difficulty to set up proper disease control strategies in Burkina Faso.
Bovine tuberculosis is an infectious disease caused by Mycobacterium bovis in livestock and wild animals. Humans can acquire this germ by aerogenous route when in close contact with infected animals, or by consuming unpasteurized dairy products from infected animals and also through the skin when handling infected carcasses. For the present study in Burkina Faso, M. bovis strains were collected from slaughtered animals during routine veterinarian inspection at the slaughterhouses of Bobo Dioulasso and Ouagadougou and also from patients with suspected pulmonary tuberculosis. The isolates were genetically characterized using three techniques: spoligotyping, MIRU-VNTR and RDAf1 deletion analysis. Our results highlight two aspects of M. bovis epidemiology that are crucial for disease control: i) M. bovis circulates between Burkina Faso and its neighbouring countries and ii) M. bovis is transmitted mainly between cattle, but also between cattle and humans, and potentially between humans. This study stresses the need to develop an efficient strategy to control M. bovis transmission, but also the difficulty to implement control measures because of the complex epidemiology of bovine tuberculosis in Burkina Faso.
The evaluation of statistical significance has become a critical process in identifying differentially expressed genes in microarray studies. Classical p-value adjustment methods for multiple comparisons such as family-wise error rate (FWER) have been found to be too conservative in analyzing large-screening microarray data, and the False Discovery Rate (FDR), the expected proportion of false positives among all positives, has been recently suggested as an alternative for controlling false positives. Several statistical approaches have been used to estimate and control FDR, but these may not provide reliable FDR estimation when applied to microarray data sets with a small number of replicates.
We propose a rank-invariant resampling (RIR) based approach to FDR evaluation. Our proposed method generates a biologically relevant null distribution, which maintains similar variability to observed microarray data. We compare the performance of our RIR-based FDR estimation with that of four other popular methods. Our approach outperforms the other methods both in simulated and real microarray data.
We found that the SAM's random shuffling and SPLOSH approaches were liberal and the other two theoretical methods were too conservative while our RIR approach provided more accurate FDR estimation than the other approaches.
In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available.
We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data.
We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks.
An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.
In prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.
The data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.
The discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.
Performance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2288-14-116) contains supplementary material, which is available to authorized users.
Clinical prediction models; Model validation; Multiple imputation; Quality of life; Shrinkage
In eukaryotes, most DNA-binding proteins exert their action as members of large effector complexes. The presence of these complexes are revealed in high-throughput genome-wide assays by the co-occurrence of the binding sites of different complex components. Resampling tests are one route by which the statistical significance of apparent co-occurrence can be assessed.
We have investigated two resampling approaches for evaluating the statistical significance of binding-site co-occurrence. The permutation test approach was found to yield overly favourable p-values while the independent resampling approach had the opposite effect and is of little use in practical terms. We have developed a new, pragmatically-devised hybrid approach that, when applied to the experimental results of an Polycomb/Trithorax study, yielded p-values consistent with the findings of that study. We extended our investigations to the FL method developed by Haiminen et al, which derives its null distribution from all binding sites within a dataset, and show that the p-value computed for a pair of factors by this method can depend on which other factors are included in that dataset. Both our hybrid method and the FL method appeared to yield plausible estimates of the statistical significance of co-occurrences although our hybrid method was more conservative when applied to the Polycomb/Trithorax dataset.
A high-performance parallelized implementation of the hybrid method is available.
We propose a new resampling-based co-occurrence significance test and demonstrate that it performs as well as or better than existing methods on a large experimentally-derived dataset. We believe it can be usefully applied to data from high-throughput genome-wide techniques such as ChIP-chip or DamID. The Cooccur package, which implements our approach, accompanies this paper.
Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study.
Though the use of microarrays to identify differentially expressed (DE) genes has become commonplace, it is still not a trivial task. Microarray data are notorious for being noisy, and current DE gene methods do not fully utilize pre-existing biological knowledge to help control this noise. One such source of knowledge is the vast number of publicly available microarray datasets. To leverage this information, we have developed the SVD Augmented Gene expression Analysis Tool (SAGAT) for identifying DE genes. SAGAT extracts transcriptional modules from publicly available microarray data and integrates this information with a dataset of interest. We explore SAGAT's ability to improve DE gene identification on simulated data, and we validate the method on three highly replicated biological datasets. Finally, we demonstrate SAGAT's effectiveness on a novel human dataset investigating the transcriptional response to insulin resistance. Use of SAGAT leads to an increased number of insulin resistant candidate genes, and we validate a subset of these with qPCR. We provide SAGAT as an open source R package that is applicable to any human microarray study.
Bovine tuberculosis (bTB) is a disease with major implications for animal welfare and productivity, as well as having the potential for zoonotic transmission. In Great Britain (GB) alone, controlling bTB costs in the region of £100 million annually, with the current control scheme seemingly unable to stop the inexorable spread of infection. One aspect that may be driving the epidemic is evolution of the causative pathogen, Mycobacterium bovis. To understand the underlying genetic changes that may be responsible for this evolution, we performed a comprehensive genome-level analyses of 4 M. bovis strains that encompass the main molecular types of the pathogen circulating in GB.
We have used a combination of genome sequencing, transcriptome analyses, and recombinant DNA technology to define genetic differences across the major M. bovis lineages circulating in GB that may give rise to phenotypic differences of practical importance. The genomes of three M. bovis field isolates were sequenced using Illumina sequencing technology and strain specific differences in gene expression were measured during in vitro growth and in ex vivo bovine alveolar macrophages using a whole genome amplicon microarray and a whole genome tiled oligonucleotide microarray. SNP/small base pair insertion and deletions and gene expression data were overlaid onto the genomic sequence of the fully sequenced strain of M. bovis 2122/97 to link observed strain specific genomic differences with differences in RNA expression.
We show that while these strains show extensive similarities in their genetic make-up and gene expression profiles, they exhibit distinct expression of a subset of genes. We provide genomic, transcriptomic and functional data to show that synonymous point mutations (sSNPs) on the coding strand can lead to the expression of antisense transcripts on the opposing strand, a finding with implications for how we define a 'silent’ nucleotide change. Furthermore, we show that transcriptomic data based solely on amplicon arrays can generate spurious results in terms of gene expression profiles due to hybridisation of antisense transcripts. Overall our data suggest that subtle genetic differences, such as sSNPS, may have important consequences for gene expression and subsequent phenotype.
Bovine tuberculosis; Mycobacterium bovis; Microarray; Transcript; SNP; Antisense; Macrophage
The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.
Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.
The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.
The Significance Analysis of Microarrays (SAM) is a popular method for detecting significantly expressed genes and controlling the false discovery rate (FDR). Recently, it has been reported in the literature that the FDR is not well controlled by SAM. Due to the vast application of SAM in microarray data analysis, it is of great importance to have an extensive evaluation of SAM and its associated R-package (sam2.20).
Our study has identified several discrepancies between SAM and sam2.20. One major difference is that SAM and sam2.20 use different methods for estimating FDR. Such discrepancies may cause confusion among the researchers who are using SAM or are developing the SAM-like methods. We have also shown that SAM provides no meaningful estimates of FDR and this problem has been corrected in sam2.20 by using a different formula for estimating FDR. However, we have found that, even with the improvement sam2.20 has made over SAM, sam2.20 may still produce erroneous and even conflicting results under certain situations. Using an example, we show that the problem of sam2.20 is caused by its use of asymmetric cutoffs which are due to the large variability of null scores at both ends of the order statistics. An obvious approach without the complication of the order statistics is the conventional symmetric cutoff method. For this reason, we have carried out extensive simulations to compare the performance of sam2.20 and the symmetric cutoff method. Finally, a simple modification is proposed to improve the FDR estimation of sam2.20 and the symmetric cutoff method.
Our study shows that the most serious drawback of SAM is its poor estimation of FDR. Although this drawback has been corrected in sam2.20, the control of FDR by sam2.20 is still not satisfactory. The comparison between sam2.20 and the symmetric cutoff method reveals that the relative performance of sam2.20 to the symmetric cutff method depends on the ratio of induced to repressed genes in a microarray data, and is also affected by the ratio of DE to EE genes and the distributions of induced and repressed genes. Numerical simulations show that the symmetric cutoff method has the biggest advantage over sam2.20 when there are equal number of induced and repressed genes (i.e., the ratio of induced to repressed genes is 1). As the ratio of induced to repressed genes moves away from 1, the advantage of the symmetric cutoff method to sam2.20 is gradually diminishing until eventually sam2.20 becomes significantly better than the symmetric cutoff method when the differentially expressed (DE) genes are either all induced or all repressed genes. Simulation results also show that our proposed simple modification provides improved control of FDR for both sam2.20 and the symmetric cutoff method.
Motivation: Resampling methods, such as permutation and bootstrap, have been widely used to generate an empirical distribution for assessing the statistical significance of a measurement. However, to obtain a very low P-value, a large size of resampling is required, where computing speed, memory and storage consumption become bottlenecks, and sometimes become impossible, even on a computer cluster.
Results: We have developed a multiple stage P-value calculating program called FastPval that can efficiently calculate very low (up to 10−9) P-values from a large number of resampled measurements. With only two input files and a few parameter settings from the users, the program can compute P-values from empirical distribution very efficiently, even on a personal computer. When tested on the order of 109 resampled data, our method only uses 52.94% the time used by the conventional method, implemented by standard quicksort and binary search algorithms, and consumes only 0.11% of the memory and storage. Furthermore, our method can be applied to extra large datasets that the conventional method fails to calculate. The accuracy of the method was tested on data generated from Normal, Poison and Gumbel distributions and was found to be no different from the exact ranking approach.
Availability: The FastPval executable file, the java GUI and source code, and the java web start server with example data and introduction, are available at http://wanglab.hku.hk/pvalue
Supplementary information: Supplementary data are available at Bioinformatics online and http://wanglab.hku.hk/pvalue/.
Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance.
We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors.
Given its highly intuitive underlying principles – simplicity, ease-of-use, and robustness – the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.
Microarray gene expression signatures hold great promise to improve diagnosis and prognosis of disease. However, current documentation standards of such signatures do not allow for an unambiguous application to study-external patients. This hinders independent evaluation, effectively delaying the use of signatures in clinical practice. Data from eight publicly available clinical microarray studies were analyzed and the consistency of study-internal with study-external diagnoses was evaluated. Study-external classifications were based on documented information only. Documenting a signature is conceptually different from reporting a list of genes. We show that even the exact quantitative specification of a classification rule alone does not define a signature unambiguously. We found that discrepancy between study-internal and study-external diagnoses can be as frequent as 30% (worst case) and 18% (median). By using the proposed documentation by value strategy, which documents quantitative preprocessing information, the median discrepancy was reduced to 1%. The process of evaluating microarray gene expression diagnostic signatures and bringing them to clinical practice can be substantially improved and made more reliable by better documentation of the signatures.
It has been shown that microarray based gene expression signatures have the potential to be powerful tools for patient stratification, diagnosis of disease, prognosis of survival, assessment of risk group, and selection of treatment. However, documentation standards in current publications do not allow for a signature's unambiguous application to study-external patients. This hinders independent evaluation, effectively delaying the use of signatures in clinical practice. Based on eight clinical microarray studies, we show that common documentation standards have the following shortcoming: when using the documented information only, the same patient might receive a diagnosis different from the one he would have received in the original study. To address the problem, we derive a documentation protocol that reduces the ambiguity of diagnoses to a minimum. The resulting gain in consistency of study-internal versus study-external diagnosis is validated by statistical resampling analysis: using the proposed documentation by value strategy, the median inconsistency dropped from 18% to 1%. Software implementing the proposed method, as well as practical guidelines for using it, are provided. We conclude that the process of evaluating microarray gene expression diagnostic signatures and bringing them to clinical practice can be substantially improved and made more reliable by better documentation.
High-density oligonucleotide microarray technology enables the discovery of genes that are transcriptionally modulated in different biological samples due to physiology, disease or intervention. Methods for the identification of these so-called "differentially expressed genes" (DEG) would largely benefit from a deeper knowledge of the intrinsic measurement variability. Though it is clear that variance of repeated measures is highly dependent on the average expression level of a given gene, there is still a lack of consensus on how signal reproducibility is linked to signal intensity. The aim of this study was to empirically model the variance versus mean dependence in microarray data to improve the performance of existing methods for identifying DEG.
In the present work we used data generated by our lab as well as publicly available data sets to show that dispersion of repeated measures depends on location of the measures themselves following a power law. This enables us to construct a power law global error model (PLGEM) that is applicable to various Affymetrix GeneChip data sets. A new DEG identification method is therefore proposed, consisting of a statistic designed to make explicit use of model-derived measurement spread estimates and a resampling-based hypothesis testing algorithm.
The new method provides a control of the false positive rate, a good sensitivity vs. specificity trade-off and consistent results with varying number of replicates and even using single samples.
Background. In the last decade, population pharmacokinetic (PopPK) modeling has spread its influence
in the whole process of drug research and development. While targeting the construction of the dose-concentration of
a drug based on a population of patients, it shows great flexibility in dealing with sparse samplings and unbalanced designs. The resampling approach has been considered an important statistical tool to assist in PopPK model validation by measuring the uncertainty of parameter estimates and evaluating the influence of individuals. Methods. The current work describes a graphical diagnostic approach for PopPK models by visualizing resampling statistics, such as case deletion and bootstrap. To examine resampling statistics, we adapted visual methods from multivariate analysis, parallel coordinate plots, and multidimensional scaling. Results. Multiple models were fitted, the information of parameter estimates and diagnostics were extracted, and the results were visualized. With careful scaling, the dependencies between different statistics are revealed. Using typical examples, the approach proved to have great capacity to identify influential outliers from the statistical perspective, which deserves special attention in a dosing regimen. Discussion. By combining static graphics with interactive graphics, we are
able to explore the multidimensional data from an integrated and systematic perspective. Complementary to current approaches, our proposed method provides a new way for PopPK modeling analysis.
A new procedure is proposed to balance type I and II errors in significance testing for differential expression of individual genes. Suppose that a collection, ℱk, of k lists of selected genes is available, each of them approximating by their content the true set of differentially expressed genes. For example, such sets can be generated by a subsampling counterpart of the delete-d-jackknife method controlling the per-comparison error rate for each subsample. A final list of candidate genes, denoted by S*, is composed in such a way that its contents be closest in some sense to all the sets thus generated. To measure “closeness” of gene lists, we introduce an asymmetric distance between sets with its asymmetry arising from a generally unequal assignment of the relative costs of type I and type II errors committed in the course of gene selection. The optimal set S* is defined as a minimizer of the average asymmetric distance from an arbitrary set S to all sets in the collection ℱk. The minimization problem can be solved explicitly, leading to a frequency criterion for the inclusion of each gene in the final set. The proposed method is tested by resampling from real microarray gene expression data with artificially introduced shifts in expression levels of pre-defined genes, thereby mimicking their differential expression.
Multiple testing; error rates; resampling; microarray data
Non-parametric bootstrapping is a widely-used statistical procedure for assessing confidence of model parameters based on the empirical distribution of the observed data  and, as such, it has become a common method for assessing tree confidence in phylogenetics . Traditional non-parametric bootstrapping does not weigh each tree inferred from resampled (i.e., pseudo-replicated) sequences. Hence, the quality of these trees is not taken into account when computing bootstrap scores associated with the clades of the original phylogeny. As a consequence, traditionally, the trees with different bootstrap support or those providing a different fit to the corresponding pseudo-replicated sequences (the fit quality can be expressed through the LS, ML or parsimony score) contribute in the same way to the computation of the bootstrap support of the original phylogeny.
In this article, we discuss the idea of applying weighted bootstrapping to phylogenetic reconstruction by weighting each phylogeny inferred from resampled sequences. Tree weights can be based either on the least-squares (LS) tree estimate or on the average secondary bootstrap score (SBS) associated with each resampled tree. Secondary bootstrapping consists of the estimation of bootstrap scores of the trees inferred from resampled data. The LS and SBS-based bootstrapping procedures were designed to take into account the quality of each "pseudo-replicated" phylogeny in the final tree estimation. A simulation study was carried out to evaluate the performances of the five weighting strategies which are as follows: LS and SBS-based bootstrapping, LS and SBS-based bootstrapping with data normalization and the traditional unweighted bootstrapping.
The simulations conducted with two real data sets and the five weighting strategies suggest that the SBS-based bootstrapping with the data normalization usually exhibits larger bootstrap scores and a higher robustness compared to the four other competing strategies, including the traditional bootstrapping. The high robustness of the normalized SBS could be particularly useful in situations where observed sequences have been affected by noise or have undergone massive insertion or deletion events. The results provided by the four other strategies were very similar regardless the noise level, thus also demonstrating the stability of the traditional bootstrapping method.
Identification of differentially expressed genes (DEGs) under different experimental conditions is an important task in many microarray studies. However, choosing which method to use for a particular application is problematic because its performance depends on the evaluation metric, the dataset, and so on. In addition, when using the Affymetrix GeneChip® system, researchers must select a preprocessing algorithm from a number of competing algorithms such as MAS, RMA, and DFW, for obtaining expression-level measurements. To achieve optimal performance for detecting DEGs, a suitable combination of gene selection method and preprocessing algorithm needs to be selected for a given probe-level dataset.
We introduce a new fold-change (FC)-based method, the weighted average difference method (WAD), for ranking DEGs. It uses the average difference and relative average signal intensity so that highly expressed genes are highly ranked on the average for the different conditions. The idea is based on our observation that known or potential marker genes (or proteins) tend to have high expression levels. We compared WAD with seven other methods; average difference (AD), FC, rank products (RP), moderated t statistic (modT), significance analysis of microarrays (samT), shrinkage t statistic (shrinkT), and intensity-based moderated t statistic (ibmT). The evaluation was performed using a total of 38 different binary (two-class) probe-level datasets: two artificial "spike-in" datasets and 36 real experimental datasets. The results indicate that WAD outperforms the other methods when sensitivity and specificity are considered simultaneously: the area under the receiver operating characteristic curve for WAD was the highest on average for the 38 datasets. The gene ranking for WAD was also the most consistent when subsets of top-ranked genes produced from three different preprocessed data (MAS, RMA, and DFW) were compared. Overall, WAD performed the best for MAS-preprocessed data and the FC-based methods (AD, WAD, FC, or RP) performed well for RMA and DFW-preprocessed data.
WAD is a promising alternative to existing methods for ranking DEGs with two classes. Its high performance should increase researchers' confidence in microarray analyses.