|Home | About | Journals | Submit | Contact Us | Français|
High-throughput profiling has generated massive amounts of data across basic, clinical and translational research fields. However, open source comprehensive web tools for analysing data obtained from different platforms and technologies are still lacking. To fill this gap and the unmet computational needs of ongoing research projects, we developed O-miner, a rapid, comprehensive, efficient web tool that covers all the steps required for the analysis of both transcriptomic and genomic data starting from raw image files through in-depth bioinformatics analysis and annotation to biological knowledge extraction. O-miner was developed from a biologist end-user perspective. Hence, it is as simple to use as possible within the confines of the complexity of the data being analysed. It provides a strong analytical suite able to overlay and harness large, complicated, raw and heterogeneous sets of profiles with biological/clinical data. Biologists can use O-miner to analyse and integrate different types of data and annotations to build knowledge of relevant altered mechanisms and pathways in order to identify and prioritize novel targets for further biological validation. Here we describe the analytical workflows currently available using O-miner and present examples of use. O-miner is freely available at www.o-miner.org.
High-throughput profiling platforms have produced a large amount of data with public repositories such as the Genome Expression Omnibus (1) and ArrayExpress (2,3) already storing tens of thousands of profiles across different experimental conditions. There is a steady growth in the amount and diversity of profiling results causing challenges in data analysis and integration as well as a strong need for novel comprehensive online bioinformatics tools which are easy to use by biologists and able to process raw profiles in a single- or global-analysis manner.
Although many methods are now available for low- and high-level analysis of genomic and transcriptomic experiments (4–7), most require programming knowledge as well as bioinformatics expertise and results can vary substantially amongst these. Analysis of large data sets may involve the need for powerful computational resources as well as time and effort to set up the necessary infrastructure. For example, the use of aroma.affymetrix (4,8) for analysis of copy number data involves the creation of annotation files via a specific directory with a strict directory structure to organize raw and processed data. Additionally, there is no analytical tool that can handle raw and/or partially processed genomics data and annotate/display results online in a user-friendly manner that would alleviate the need for bioinformatics expertise and allow researchers to process their in-house data in isolation or alongside the accumulated publicly available data in their area of research.
To overcome these problems, we have developed O-miner (http://www.o-miner.org), which can analyse the most popular, and widely used Affymetrix genomics and transcriptomics array types on the fly starting from raw standard Affymetrix file format (CEL image files obtained from the scanner) or partially processed format (normalized, segmented and/or binary) with minimal set-up efforts. The analysis is performed on a dedicated server removing memory or disk space requirements on end-user machines. All analytical pipelines are transparent, robust, well documented and based on well-established and recently developed statistical methods. Results can be viewed online as dynamic HTML reports for easy navigation through an interactive friendly interface or downloaded as text, excel or graphics files.
O-miner is comprehensive, robust, memory-efficient and can easily be extended with new methods and algorithms to cover additional chip types and platforms. In this article, we provide an overview of O-miner and discuss both transcriptomics and genomics workflows. We outline some examples of use to show how to perform low single-level as well as high global-level analysis and to illustrate how to navigate through obtained results. Finally, we discuss future updates of the software to accommodate and link additional data types.
O-miner provides a framework for automated analysis of different types of -omics data and currently covers the analysis and annotation of both genomic and transcriptomic data. The user must first upload the data files to be analysed as a zip archive to the O-miner server using the graphical interface or enter a valid GEO series number (GSE format). This alleviates the time-consuming and repetitive task of uploading one data file at a time. Once data transfer is completed, the ‘File Organizer’ window displays the individual files and can be used for the assignment of sample names and biological groups before specifying the analysis options. A unique project is created for each submitted analysis.
A general copy number analysis pipeline starts from probe level raw intensity .CEL data files obtained immediately after scanning, through background adjustment, normalization and summarization to derive raw copy number data (normalized log2 ratio sample/reference format) followed by segmentation and smoothing (segmented format) before thresholding and calling regions of copy number gain and loss (binary format). To the best of our knowledge, O-miner is the only freely available web tool that can accept data submission at any stage of this pipeline either as .CEL files or partially processed (normalized, segmented or binary) data files.
O-miner enables the two common scenarios of copy number analysis. The first is a paired analysis option where each sample is coupled with a specific unique reference (e.g. a cancer sample with its corresponding matched normal sample). The second is an unpaired analysis where each sample uses the same common reference, which is often the average of a pool of samples. Both options are possible on a wide variety of Affymetrix platforms including the widely used GeneChip® Human Mapping Arrays 10K, 100K Set (50K_Hind240 and 50K_Xba240) and 500K Set (250K_Nsp and 250K_Sty) as well as Genome-Wide Human SNP Arrays 5.0 and 6.0. We have processed and made available precompiled raw HapMap data (CEL files) from four human populations: African YRI (from Yoruba in Ibadan, Nigeria), Japanese JPT (from Tokyo, Japan), Han Chinese CHB (from Beijing, China) and European CEU (from Utah, USA with ancestry from northern and western Europe) to use as a baseline in an unpaired analysis scenario. After extracting the zip archive, O-miner displays the available .CEL files list in the ‘File Organizer’ for the user to create Sample/Reference attributes and define Samples and References lists. The first file in the Sample Files list is compared to the first file in the Reference Files list, the second files in both lists are compared to each other and so on. In the same manner, files from different enzyme sets for Human Mapping 100K/500K array sets can be paired to match and merge array sets originating with the same sample. The Sample/Reference attributes are not required for unpaired analysis. Data are categorized by entering a biological group attribute to define the biological source/state at the origin of each array (primary, metastasis, resistant, etc.). O-miner combines the results observed in the same biological source/state and performs group comparisons.
O-miner reads CEL intensity files and automatically builds up the required directory structure and annotation files to run the methods implemented in the aroma. affymetrix framework (4,8). Briefly, O-miner performs initial quality control checks, background correction, allelic cross-talk calibration, nucleotide-position probe sequence effects normalization, probe-level summarization using robust average (for SNP 5.0 and 6.0 arrays) or log-additive model (for 10, 100 and 500K arrays), PCR fragment-length effects normalization and calculates raw copy number estimates (log2 ratios) relative to the chosen reference. These normalized estimates are used as input for segmentation methods to identify copy number regions and further subsequent analysis as explained below.
Normalized data text files as obtained from the raw CEL analysis described above or from other normalization methods and algorithms can be used. If uploaded as a new submission, the ‘File Organizer’ extracts the sample names from the column headings of the uploaded file and offers the option to enter a biological group attribute to define the biological source/state at the origin of each sample for further subgroup analysis. At this level, O-miner is ready to apply a segmentation analysis by offering 10 popular algorithms as implemented in the R package CGHweb (5). Briefly, these are BioHMM, CBS, FASeg, cghFLasso, CGHseg, GLAD, LOWESS, Wavelet smoothing, Quantile Smoothing and Running Average (9–17). The user selects the method(s) to be used to derive a consensus profile from multiple probes/samples. Added to the benefit of assessing segmented profiles from different algorithms, this also offers the user the possibility of checking whether a copy number alteration arose as an artefact of the specified segmentation method. The results are then ready to be processed to determine the regions of gains/losses according to user-defined cut-offs based on the log2 ratio threshold value, consecutive number of SNPs that form a copy number region (at least 15 SNPs by default) and frequency of samples where a copy number event was observed (at least 20% by default). O-miner offers an option to predict the log2 ratio threshold based on the quantile distribution of segmented raw copy numbers. Once a threshold is determined the data could be binary coded (0: no changes, 1: copy number gain, −1: copy number loss) for subsequent analysis. Similarly, users can start their data analysis from this level by submitting a binary coded data file.
Once regions of gains/losses have been determined, O-miner can provide physical and cytogenetic mapping information as well as related gene annotations from UCSC (18), NCBI RefSeq (19), Ensembl (20) and VEGA (21). O-miner also investigates regulatory elements, such as conserved Transcription Factor Binding Sites (22) and microRNA (23,24). As disease/critical genes are more likely to be located in copy number regions that are common/recurrent among samples, O-miner provides the analysis option of identifying recurrent regions of copy number alterations within the biological groups being investigated. These minimum common regions (MCR) can be calculated by using one of the three robust methods: CGHregions (25), RJaCGH (26) and MSA (27).
Expression profiling analysis starts from probe level raw intensity .CEL data files obtained immediately after scanning, through background correction, normalization and summarization to derive expression measurements data (normalized data matrix) followed by filtering to reduce data dimensionality and differential analysis to detect de-regulated genes. O-miner accepts data submission as .CEL files, normalized or filtered data matrix files. O-miner enables the analysis of paired samples/replicates.
A wide variety of Affymetrix platforms including the widely used GeneChip® Human Genome Arrays U95 Set (U95A, U95Av2, U95B, U95C, U95D, U95E), U133 Set (U133A and U133B), U133A 2.0 and U133 Plus 2.0 are available. After extracting the uploaded archive of array files, O-miner displays the available .CEL files list in the ‘File Organizer’ window for the user to define the samples list and biological source/state at the origin of each array. If performing a paired analysis, the user needs to arrange the samples in pairs in the related two group lists in the ‘File Organizer’. If the experiment contains technical replicates, the user must indicate the replicates in the additional ‘Replicate’ column that will appear in the ‘File Organizer’. O-miner combines the results observed in the same biological source/state and performs differential analysis between selected groups.
O-miner reads CEL intensity files and runs the quality control (QC) methods implemented in the R package ArrayMvout (28) to automatically exclude outliers from subsequent analysis. An additional manual check could be performed using ArrayQualityMetrics (29). This is followed by normalization using RMA (30), GCRMA (31) or tRMA (32). These normalized estimates are used as input for filtering and differential analysis methods to identify de-regulated expression and run further analyses as outlined below.
Normalized data text files as obtained from the raw CEL analysis described above or from other normalization methods can be used. If uploaded as a new submission, the ‘File Organizer’ window displays the sample names as extracted from the uploaded file and offers the option to enter a biological group attribute to define the biological source/state at the origin of each sample for further subgroup analysis. At this level, O-miner is ready to apply a filtering step to reduce the dimensionality of the data by offering three popular methods: interquartile range (IQR) (soft, intermediate, robust), intensity (25% or 50% of samples above 100) or standard deviation (top 10% or 5% most variable probes). Differential expression analysis is applied to the filtered matrix using LIMMA (33). O-miner will automatically refresh to display a ‘LIMMA comparison’ section with the list of biological groups allowing the user to define the contrast and design matrices required by LIMMA based on the user selection of the comparisons between the predefined biological groups. A number of statistics for differential expression are provided to refine the de-regulated genes list according to user-defined cut-offs based on log2 fold change values (2 by default) and P-values (0.05 by default) adjusted using Holm (34), Benjamini and Hochberg (BH) (also known as FDR) (35) or Benjamini and Yekutieli (BY) (36) multiple testing correction methods.
GOstats (37) can be used to assess the overrepresentation of GO terms among the GO annotations for the differentially expressed genes. Additional expression plots can also be generated from the results page allowing the user to examine the level/change in expression among the experimental datasets for a particular gene(s)/probe(s) of interest in the filtered data. A Venn diagram for up to four biological groups can be produced to show the common and specific differentially expressed probes (all, up- or down-regulated).
O-miner provides comprehensive interactive web pages in a tabbed browsing format that are intended to guide the user through the key results for their analysis. All data are also available to download and view locally as text, excel or image files.
Results are displayed as a tabbed view representing QC, clustering, MCR (if selected), sample and group information. Using the ‘Sample View’ it is possible to browse through the results obtained for each individual sample including log2ratio plots and annotated regions of gains and losses that can also be viewed as a track in the UCSC Genome Browser alongside a rich collection of annotations. The ‘Group View’ summarizes results based on the biological groups originally defined by the user including frequency plots and a gene-level view to summarize the gene content within copy number alterations.
To demonstrate the functionality of O-miner, we analysed 25 samples from mutated (KIT or PDGFRA) or wild-type gastrointestinal stromal tumours (GISTs) profiled using Affymetrix Genome-Wide Human SNP 6.0 platform (GSE20709). We applied an unpaired analysis using the wild-type patients as baseline. We used Picard, Fused Lasso and CBS algorithms for segmentation and applied a minimum physical length of at least 15 consecutive SNPs for putative regions of genetic alterations. The threshold for gains or losses was determined by O-miner based on the inspection of the quantile distribution of the segmented ratios. O-miner provides straightforward access to results for each biological group, and an easy way to drill down to individual results for a specified sample. The ‘Sample View’ and ‘Group View’ of obtained results with related mining options are presented in Figures 1 and and2,2, respectively. One can navigate through putative regions of gains and losses, frequency plots for a specified sample and automatically view this information within the UCSC Genome Browser where we could zoom in to specific regions of interest. This allows the data to be mined and visualized alongside a large collection of annotation data tracks. Our results clearly show the hot spots for copy number loss on chromosomes 1, 14 and 22 as previously reported. The ‘Group View’ can easily be used to overlay and compare results from the two mutated sample sets (Figure 3).
To demonstrate further capabilities of O-miner, we analysed a panel of 12 primary effusion lymphoma (PEL) cell lines profiled with the Affymetrix GeneChip® Human Mapping Arrays 500K Set (GSE28684) (38) using an unpaired analysis with normal tonsil controls. Segmentation and thresholding methods were defined as in the previous example. We replicated the previously reported PEL-associated genomic amplifications in chromosome 1q, 7, 8 and 12. Furthermore, as the majority of PEL are co-infected with Epstein–Barr virus (EBV), we segregated PEL samples into EBV-positive and EBV-negative subgroups and investigated the recurrent copy number alterations in each group using MSA. As shown in Figure 4A, one could quickly compare and visualize MCR plots at the genome or chromosome level for both biological groups. Results are available in HTML, Excel or Bed formats. Results can also be viewed in the UCSC Genome Browser (Figure 4B), where we compared a detected MCR region on chromosome 19p13.3 across the biological subgroups and investigated its gene content. In a few seconds, this visual inspection narrowed down an MCR of genetic gain specific to the EBV-negative subgroup. By displaying the RefSeq annotation track, we directly pointed to RFX2, ACSBG2 and FUT3 genes reported in the original study to be altered only in the EBV-negative PEL subgroup. We also identified few other important genes mapping to this MCR and also relevant to EBV-negative PEL subgroup.
In addition to analysing data from individual studies, O-miner provides a high-level analysis option. Data from multiple sources/formats can be merged at different levels (.CEL files, normalized, segmented or binary) and submitted to O-miner. This gives the user increased flexibility for carrying out a global analysis dependent on the data types available. For example if .CEL files are not available, it is possible to submit merged partially analysed data (normalized, segmented or binary coded format). This also provides a method of submitting larger datasets.
Results are displayed as a tabbed view representing QC, clustering, differential expression, gene ontology (if selected) and expression plots. As an example, we analysed six drug-resistant/parental MIA-PaCa-2 pancreatic cell lines profiled using Affymetrix GeneChip® Human Genome Arrays U133 Plus 2.0 (GSE16648) (39). After applying QC, normalization using GCRMA and filtering by standard deviation to select the top 5% of most variable probes, we performed a differential expression analysis using LIMMA to compare resistant to parental cell lines. A typical O-miner tabbed output includes QC information, differentially expressed genes, a cluster dendrogram, overrepresented gene ontology terms and an expression plot generator that could be used to produce expression plots on the fly to compare the expression level of a gene(s)/probe(s) of interest across the array data within the defined biological groups (Figure 5).
O-miner can also be used to run a rapid global analysis on transcriptomics data. For example, we analysed .CEL files from three prostate cell lines (LNCaP, DU145 and PC3) from three different studies in ArrayExpress/GEO (E-TABM-948, GSE32474 and E-GEOD-28846). Figure 6 demonstrates additional O-miner output capabilities and shows a Venn diagram indicating the overlap of differentially expressed probes between the different cell lines and clustering of expression data across the experimental groups.
O-miner is a useful and flexible tool, particularly for biologists to carry out routine data analysis without the need for a complex IT infrastructure or in-depth bioinformatics support. Future plans include the addition of further analysis pipelines, in particular for methylation, miRNA and downstream mining of next-generation sequencing data. In its current version, O-miner allows users to submit data by giving the GEO series number. For the moment this is limited to series with samples profiled on the same platform. We plan to develop this further in future releases. We are also planning to cover additional platforms/species such as Illumina and Affymetrix Whole-Transcript arrays and to make O-miner available as an R package.
Breast Cancer Campaign (to R.J.C.); Cancer Research UK (to A.S and A.Z.D.U). Funding for open access charge: Cancer Research UK [programme grant reference 15310].
Conflict of interest statement. None declared.
Authors thank their colleagues who have tested O-miner.