The explosion of available microarray data on human cancer increases the urgency for developing methods for effectively sharing this data among clinical cancer investigators. Lack of a smooth interface between the databases and statistical analysis tools limits the potential benefits of sharing the publicly available microarray data. To facilitate the efficient sharing and use of publicly available microarray data among cancer investigators, we have built a BRB-ArrayTools Data Archive including over one hundred human cancer microarray projects for 28 cancer types. Expression array data and clinical descriptors have been imported into BRB-ArrayTools and are stored as BRB-ArrayTools project folders on the archive. The data archive can be accessed from: http://linus.nci.nih.gov/~brb/DataArchive.html Our BRB-ArrayTools data archive and GEO importer represent ongoing efforts to provide effective tools for efficiently sharing and utilizing human cancer microarray data.
Helicobacter pylori infection reprograms host gene expression and influences various cellular processes, which have been investigated by cDNA microarray using in vitro culture cells and in vivo gastric biopsies from patients of the Chronic Abdominal Complaint. To further explore the effects of H. pylori infection on host gene expression, we have collected the gastric antral mucosa samples from 6 untreated patients with gastroscopic and pathologic confirmation of chronic superficial gastritis. Among them three patients were infected by H. pylori and the other three patients were not. These samples were analyzed by a microarray chip which contains 14,112 cloned cDNAs, and microarray data were analyzed via BRB ArrayTools software and Ingenuity Pathways Analysis (IPA) website. The results showed 34 genes of 38 differentially expressed genes regulated by H. pylori infection had been annotated. The annotated genes were involved in protein metabolism, inflammatory and immunological reaction, signal transduction, gene transcription, trace element metabolism, and so on. The 82% of these genes (28/34) were categorized in three molecular interaction networks involved in gene expression, cancer progress, antigen presentation and inflammatory response. The expression data of the array hybridization was confirmed by quantitative real-time PCR assays. Taken together, these data indicated that H. pylori infection could alter cellular gene expression processes, escape host defense mechanism, increase inflammatory and immune responses, activate NF-κB and Wnt/β-catenin signaling pathway, disturb metal ion homeostasis, and induce carcinogenesis. All of these might help to explain H. pylori pathogenic mechanism and the gastroduodenal pathogenesis induced by H. pylori infection.
The prognosis of patients with metastatic melanomas is extremely heterogeneous. Therefore, identifying high-risk subgroups by using innovative prediction models would help to improve selection of appropriate management options.
In this study, two datasets (GSE7929 and GSE7956) of mRNA expression microarray in an animal melanoma model were normalized by frozen Robust Multi-Array Analysis and then combined by the distance-weighted discrimination method to identify time course-dependent metastasis-related gene signatures by Biometric Research Branch-ArrayTools (BRB)-ArrayTools. Then two datasets (GSE8401 and GSE19234) of clinical melanoma samples with relevant clinical and survival data were used to validate the prognosis signature.
A novel 192-gene set that varies significantly in parallel with the increasing of metastatic potentials was identified in the animal melanoma model. Further, this gene signature was validated to correlate with poor prognosis of human metastatic melanomas but not of primary melanomas in two independent datasets. Furthermore, multivariate Cox proportional hazards regression analyses demonstrated that the prognostic value of the 192-gene set is independent of the TNM stage and has higher areas under the receiver operating characteristic curve than stage information in both validation datasets.
Our findings suggest that a time course-dependent metastasis-related gene expression signature is useful in predicting survival of malignant melanomas and might be useful in informing treatment decisions for these patients.
Electronic supplementary material
The online version of this article (doi:10.1186/s13000-014-0155-2) contains supplementary material, which is available to authorized users.
Melanomas; Metastasis; Prognosis; Prediction; Gene signature
Identification of gene expression profiles of cancer stem cells may have significant implications in the understanding of tumor biology and for the design of novel treatments targeted toward these cells. Here we report a potential ovarian cancer stem cell gene expression profile from isolated side population of fresh ascites obtained from women with high-grade advanced stage papillary serous ovarian adenocarcinoma. Affymetrix U133 Plus 2.0 microarrays were used to interrogate the differentially expressed genes between side population (SP) and main population (MP), and the results were analyzed by paired T-test using BRB-ArrayTools. We identified 138 up-regulated and 302 down-regulated genes that were differentially expressed between all 10 SP/MP pairs. Microarray data was validated using qRT-PCR and17/19 (89.5%) genes showed robust correlations between microarray and qRT-PCR expression data. The Pathway Studio analysis identified several genes involved in cell survival, differentiation, proliferation, and apoptosis which are unique to SP cells and a mechanism for the activation of Notch signaling is identified. To validate these findings, we have identified and isolated SP cells enriched for cancer stem cells from human ovarian cancer cell lines. The SP populations were having a higher colony forming efficiency in comparison to its MP counterpart and also capable of sustained expansion and differentiation in to SP and MP phenotypes. 50,000 SP cells produced tumor in nude mice whereas the same number of MP cells failed to give any tumor at 8 weeks after injection. The SP cells demonstrated a dose dependent sensitivity to specific γ-secretase inhibitors implicating the role of Notch signaling pathway in SP cell survival. Further the generated SP gene list was found to be enriched in recurrent ovarian cancer tumors.
Krüppel-like factor KLF4 plays a crucial role in the development and maintenance of the mouse cornea. Here, we have compared the wild type (WT) and Klf4-conditional null (Klf4CN) corneal gene expression patterns to understand the molecular basis of the Klf4CN corneal phenotype.
Expression of more than 22,000 genes in 10 WT and Klf4CN corneas was compared by microarrays, analyzed using BRB ArrayTools and validated by Q-RT-PCR. Transient cotransfections were employed to test if KLF4 activates the aquaporin-3, Aldh3a1 and TKT promoters.
Scatter plot analysis identified 740 and 529 genes up- and down-regulated by more than 2-fold, respectively, in the Klf4CN corneas. Cell cycle activators were upregulated while the inhibitors were downregulated, consistent with the increased Klf4CN corneal epithelial cell proliferation. Desmosomal components were downregulated, consistent with the Klf4CN corneal epithelial fragility. Downregulation of aquaporin-3, detected by microarray, was confirmed by immunoblot and immunohistochemistry. Aquaporin-3 promoter activity was stimulated 7–10 fold by cotransfection with pCI-KLF4. Corneal crystallins Aldh3A1 and TKT were downregulated in the Klf4CN cornea and their respective promoter activities were upregulated 16- and 9-fold by pCI-KLF4 in co-transfections. Expression of epidermal keratinocyte differentiation markers was affected in the Klf4CN cornea. While the cornea specific keratin-12 was downregulated, most other keratins were upregulated, suggesting hyperkeratosis.
We have identified functionally diverse candidate KLF4 target genes, revealing the molecular basis of the diverse aspects of the Klf4CN corneal phenotype. These results establish KLF4 as an important node in the genetic network of transcription factors regulating the corneal homeostasis.
Cornea; Development; KLF4; Microarray
There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases. We have evaluated three linear regression algorithms that can be used for the prediction of a continuous response based on high dimensional gene expression data. The three algorithms are the least angle regression (LAR), the least absolute shrinkage and selection operator (LASSO), and the averaged linear regression method (ALM). All methods are tested using simulations based on a real gene expression dataset and analyses of two sets of real gene expression data and using an unbiased complete cross validation approach. Our results show that the LASSO algorithm often provides a model with somewhat lower prediction error than the LAR method, but both of them perform more efficiently than the ALM predictor. We have developed a plug-in for BRB-ArrayTools that implements the LAR and the LASSO algorithms with complete cross-validation.
regression model; gene expression; continuous outcome
Separation of the neurosensory retina from the retinal pigment epithelium (RPE) yields many morphologic and functional consequences, including death of the photoreceptor cells, Müller cell hypertrophy, and inner retinal rewiring. Many of these changes are due to the separation-induced activation of specific genes. In this work, we define the gene transcription profile within the retina as a function of time after detachment. We also define the early activation of kinases that might be responsible for the detachment-induced changes in gene transcription.
Separation of the retina from the RPE was induced in Brown-Norway rats by the injection of 1% hyaluronic acid into the subretinal space. Retinas were harvested at 1, 7, and 28 days after separation. Gene transcription profiles for each time point were determined using the Affymetrix Rat 230A gene microarray chip. Transcription levels in detached retinas were compared to those of nondetached retinas with the BRB-ArrayTools Version 3.6.0 using a random variance analysis of variance (ANOVA) model. Confirmation of the significant transcriptional changes for a subset of the genes was performed using microfluidic quantitative real-time polymerase chain reaction (qRT-PCR) assays. Kinase activation was explored using Western blot analysis to look for early phosphorylation of any of the 3 main families of mitogen-activated protein kinases (MAPK): the p38 family, the Janus kinase family, and the p42/p44 family.
Retinas separated from the RPE showed extensive alterations in their gene transcription profile. Many of these changes were initiated as early as 1 day after separation, with significant increases by 7 days. ANOVA analysis defined 144 genes that had significantly altered transcription levels as a function of time after separation when setting a false discovery rate at ≤0.1. Confirmatory RT-PCR was performed on 51 of these 144 genes. Differential transcription detected on the microarray chip was confirmed by qRT-PCR for all 51 genes. Western blot analysis showed that the p42/p44 family of MAPK was phosphorylated within 2 hours of retinal-RPE separation. This phosphorylation was detachment-induced and could be inhibited by specific inhibitors of MAPK phosphorylation.
Separation of the retina from the RPE induces significant alteration in the gene transcription profile within the retina. These profiles are not static, but change as a function of time after detachment. These gene transcription changes are preceded by the activation of the p42/p44 family of MAPK. This altered transcription may serve as the basis for many of the morphologic, biochemical, and functional changes seen within the detached retina.
Clear cell ovarian cancer is an epithelial ovarian cancer histotype that is less responsive to chemotherapy and carries poorer prognosis than serous and endometrioid histotypes. Despite this, patients with these tumors are treated in a similar fashion as all other ovarian cancers. Previous genomic analysis has suggested that clear cell cancers represent a unique tumor subtype. Here we generated the first whole genomic expression profiling using epithelial component of clear cell ovarian cancers and normal ovarian surface specimens isolated by laser capture microdissection. All the arrays were analyzed using BRB ArrayTools and PathwayStudio software to identify the signaling pathways. Identified pathways validated using serous, clear cell cancer cell lines and RNAi technology. In vivo validations carried out using an orthotopic mouse model and liposomal encapsulated siRNA. Patient-derived clear cell and serous ovarian tumors were grafted under the renal capsule of NOD-SCID mice to evaluate the therapeutic potential of the identified pathway. We identified major activated pathways in clear cells involving in hypoxic cell growth, angiogenesis, and glucose metabolism not seen in other histotypes. Knockdown of key genes in these pathways sensitized clear cell ovarian cancer cell lines to hypoxia/glucose deprivation. In vivo experiments using patient derived tumors demonstrate that clear cell tumors are exquisitely sensitive to antiangiogenesis therapy (i.e. sunitinib) compared with serous tumors. We generated a histotype specific, gene signature associated with clear cell ovarian cancer which identifies important activated pathways critical for their clinicopathologic characteristics. These results provide a rational basis for a radically different treatment for ovarian clear cell patients.
This study evaluated the effects of black raspberries (BRBs) on biomarkers of tumor development in the human colon and rectum including methylation of relevant tumor suppressor genes, cell proliferation, apoptosis, angiogenesis and expression of Wnt pathway genes.
Biopsies of adjacent normal tissues and colorectal adenocarcinomas were taken from 20 patients before and after oral consumption of BRB powder (60g/day) for 1-to-9 wks. Methylation status of promoter regions of five tumor suppressor genes was quantified. Protein expression of DNA methyltransferase 1 (DNMT1) and genes associated with cell proliferation, apoptosis, angiogenesis, and Wnt signaling were measured.
The methylation of three Wnt inhibitors, SFRP2, SFRP5, and WIF1, upstream genes in Wnt pathway, and PAX6a, a developmental regulator, was modulated in a protective direction by BRBs in normal tissues and in colorectal tumors only in patients who received an average of 4 wks of BRB treatment, but not in all 20 patients with 1-to-9 wks of BRB treatment. This was associated with decreased expression of DNMT1. BRBs modulated expression of genes associated with Wnt pathway, proliferation, apoptosis and angiogenesis in a protective direction.
These data provide evidence of the ability of BRBs to demethylate tumor suppressor genes and to modulate other biomarkers of tumor development in the human colon and rectum. While demethylation of genes did not occur in colorectal tissues from all treated patients, the positive results with the secondary endpoints suggest that additional studies of BRBs for the prevention of colorectal cancer in humans now appear warranted.
Idiopathic Pulmonary Fibrosis (IPF) is characterized by profound changes in the lung phenotype including excessive extracellular matrix deposition, myofibroblast foci, alveolar epithelial cell hyperplasia and extensive remodeling. The role of epigenetic changes in determining the lung phenotype in IPF is unknown. In this study we determine whether IPF lungs exhibit an altered global methylation profile.
Immunoprecipitated methylated DNA from 12 IPF lungs, 10 lung adenocarcinomas and 10 normal histology lungs was hybridized to Agilent human CpG Islands Microarrays and data analysis was performed using BRB-Array Tools and DAVID Bioinformatics Resources software packages. Array results were validated using the EpiTYPER MassARRAY platform for 3 CpG islands. 625 CpG islands were differentially methylated between IPF and control lungs with an estimated False Discovery Rate less than 5%. The genes associated with the differentially methylated CpG islands are involved in regulation of apoptosis, morphogenesis and cellular biosynthetic processes. The expression of three genes (STK17B, STK3 and HIST1H2AH) with hypomethylated promoters was increased in IPF lungs. Comparison of IPF methylation patterns to lung cancer or control samples, revealed that IPF lungs display an intermediate methylation profile, partly similar to lung cancer and partly similar to control with 402 differentially methylated CpG islands overlapping between IPF and cancer. Despite their similarity to cancer, IPF lungs did not exhibit hypomethylation of long interspersed nuclear element 1 (LINE-1) retrotransposon while lung cancer samples did, suggesting that the global hypomethylation observed in cancer was not typical of IPF.
Our results provide evidence that epigenetic changes in IPF are widespread and potentially important. The partial similarity to cancer may signify similar pathogenetic mechanisms while the differences constitute IPF or cancer specific changes. Elucidating the role of these specific changes will potentially allow better understanding of the pathogenesis of IPF.
DAPfinder and DAPview are novel BRB-ArrayTools plug-ins to construct gene coexpression networks and identify significant differences in pairwise gene-gene coexpression between two phenotypes.
Each significant difference in gene-gene association represents a Differentially Associated Pair (DAP). Our tools include several choices of filtering methods, gene-gene association metrics, statistical testing methods and multiple comparison adjustments. Network results are easily displayed in Cytoscape. Analyses of glioma experiments and microarray simulations demonstrate the utility of these tools.
DAPfinder is a new friendly-user tool for reconstruction and comparison of biological networks.
Numerous microarray analysis programs have been created through the efforts of Open Source software development projects. Providing browser-based interfaces that allow these programs to be executed over the Internet enhances the applicability and utility of these analytic software tools.
Here we present ArrayQuest, a web-based DNA microarray analysis process controller. Key features of ArrayQuest are that (1) it is capable of executing numerous analysis programs such as those written in R, BioPerl and C++; (2) new analysis programs can be added to ArrayQuest Methods Library at the request of users or developers; (3) input DNA microarray data can be selected from public databases (i.e., the Medical University of South Carolina (MUSC) DNA Microarray Database or Gene Expression Omnibus (GEO)) or it can be uploaded to the ArrayQuest center-point web server into a password-protected area; and (4) analysis jobs are distributed across computers configured in a backend cluster. To demonstrate the utility of ArrayQuest we have populated the methods library with methods for analysis of Affymetrix DNA microarray data.
ArrayQuest enables browser-based implementation of DNA microarray data analysis programs that can be executed on a Linux-based platform. Importantly, ArrayQuest is a platform that will facilitate the distribution and implementation of new analysis algorithms and is therefore of use to both developers of analysis applications as well as users. ArrayQuest is freely available for use at .
The inner blood-retinal barrier (BRB) is a gliovascular unit in which macroglial cells surround capillary endothelial cells and regulate retinal capillaries by paracrine interactions. The purpose of the present study was to identify genes of retinal capillary endothelial cells whose expression is modulated by Müller glial cell-derived factors.
Conditionally immortalized rat retinal capillary endothelial (TR-iBRB2) and Müller (TR-MUL5) cell lines were chosen as an in vitro model. TR-iBRB2 cells were incubated with conditioned medium of TR-MUL5 (MUL-CM) for 24 h and subjected to microarray and quantitative real-time PCR analysis.
TR-MUL5 cell-derived factors increased alkaline phosphatase activity in TR-iBRB2 cells, indicating that paracrine interactions occurred between TR-iBRB2 and TR-MUL5 cells. Microarray analysis demonstrated that MUL-CM treatment leads to a modulation of several genes including an induction of plasminogen activator inhibitor 1 (PAI-1) and a suppression of an inhibitor of DNA binding 2 (Id2) in TR-iBRB2 cells. Treatment with TGF-β1, which is incorporated in MUL-CM, also resulted in an induction of PAI-1 and a suppression of Id2 in TR-iBRB2 cells.
In vitro inner BRB model study revealed that Müller glial cell-derived factors modulate endothelial cell functions including the induction of anti-angiogenic PAI-1 and the suppression of pro-angiogenic Id2. Therefore, Müller cells appear to be one of the modulators of retinal angiogenesis.
The high-density oligonucleotide microarray (GeneChip) is an important tool for molecular biological research aiming at large-scale detection of small nucleotide polymorphisms in DNA and genome-wide analysis of mRNA concentrations. Local array data management solutions are instrumental for efficient processing of the results and for subsequent uploading of data and annotations to a global certified data repository at the EBI (ArrayExpress) or the NCBI (GeneOmnibus).
To facilitate and accelerate annotation of high-throughput expression profiling experiments, the Microarray Information Management and Annotation System (MIMAS) was developed. The system is fully compliant with the Minimal Information About a Microarray Experiment (MIAME) convention. MIMAS provides life scientists with a highly flexible and focused GeneChip data storage and annotation platform essential for subsequent analysis and interpretation of experimental results with clustering and mining tools. The system software can be downloaded for academic use upon request.
MIMAS implements a novel concept for nation-wide GeneChip data management whereby a network of facilities is centered on one data node directly connected to the European certified public microarray data repository located at the EBI. The solution proposed may serve as a prototype approach to array data management between research institutes organized in a consortium.
Cervical cancer is the most common cancer among Indian women. The current recommendations are to treat the stage IIB, IIIA, IIIB and IVA with radical radiotherapy and weekly cisplatin based chemotherapy. However, Radiotherapy alone can help cure more than 60% of stage IIB and up to 40% of stage IIIB patients.
Archival RNA samples from 15 patients who had achieved complete remission and stayed disease free for more than 36 months (No Evidence of Disease or NED group) and 10 patients who had failed radical radiotherapy (Failed group) were included in the study. The RNA were amplified, labelled and hybridized to Stanford microarray chips and analyzed using BRB Array Tools software and Significance Analysis of Microarray (SAM) analysis. 20 genes were selected for further validation using Relative Quantitation (RQ) Taqman assay in a Taqman Low-Density Array (TLDA) format. The RQ value was calculated, using each of the NED sample once as a calibrator. A scoring system was developed based on the RQ value for the genes.
Using a seven gene based scoring system, it was possible to distinguish between the tumours which were likely to respond to the radiotherapy and those likely to fail. The mean score ± 2 SE (standard error of mean) was used and at a cut-off score of greater than 5.60, the sensitivity, specificity, Positive predictive value (PPV) and Negative predictive value (NPV) were 0.64, 1.0, 1.0, 0.67, respectively, for the low risk group.
We have identified a 7 gene signature which could help identify patients with cervical cancer who can be treated with radiotherapy alone. However, this needs to be validated in a larger patient population.
Summary: Microarrays are commonly used to detect changes in gene expression between different biological samples. For this purpose, many analysis tools have been developed that offer visualization, statistical analysis and more sophisticated analysis methods. Most of these tools are designed specifically for messenger RNA microarrays. However, today, more and more different microarray platforms are available. Changes in DNA methylation, microRNA expression or even protein phosphorylation states can be detected with specialized arrays. For these microarray technologies, the number of available tools is small compared with mRNA analysis tools. Especially, a joint analysis of different microarray platforms that have been used on the same set of biological samples is hardly supported by most microarray analysis tools. Here, we present InCroMAP, a tool for the analysis and visualization of high-level microarray data from individual or multiple different platforms. Currently, InCroMAP supports mRNA, microRNA, DNA methylation and protein modification datasets. Several methods are offered that allow for an integrated analysis of data from those platforms. The available features of InCroMAP range from visualization of DNA methylation data over annotation of microRNA targets and integrated gene set enrichment analysis to a joint visualization of data from all platforms in the context of metabolic or signalling pathways.
Availability: InCroMAP is freely available as Java™ application at www.cogsys.cs.uni-tuebingen.de/software/InCroMAP, including a comprehensive user’s guide and example files.
firstname.lastname@example.org or email@example.com
The web application D-Maps provides a user-friendly interface to researchers performing studies based on microarrays. The program was developed to manage and process one- or two-color microarray data obtained from several platforms (currently, GeneTAC, ScanArray, CodeLink, NimbleGen and Affymetrix). Despite the availability of many algorithms and many software programs designed to perform microarray analysis on the internet, these usually require sophisticated knowledge of mathematics, statistics and computation. D-maps was developed to overcome the requirement of high performance computers or programming experience. D-Maps performs raw data processing, normalization and statistical analysis, allowing access to the analyzed data in text or graphical format. An original feature presented by D-Maps is GEO (Gene Expression Omnibus) submission format service. The D-MaPs application was already used for analysis of oligonucleotide microarrays and PCR-spotted arrays (one- and two-color, laser and light scanner). In conclusion, D-Maps is a valuable tool for microarray research community, especially in the case of groups without a bioinformatic core.
microarray; web service; software; affymetrix and nimblegen
DNA microarrays provide data for genome wide patterns of expression between observation classes. Microarray studies often have small samples sizes, however, due to cost constraints or specimen availability. This can lead to poor random error estimates and inaccurate statistical tests of differential expression. We compare the performance of the standard t-test, fold change, and four small n statistical test methods designed to circumvent these problems. We report results of various normalization methods for empirical microarray data and of various random error models for simulated data.
Three Empirical Bayes methods (CyberT, BRB, and limma t-statistics) were the most effective statistical tests across simulated and both 2-colour cDNA and Affymetrix experimental data. The CyberT regularized t-statistic in particular was able to maintain expected false positive rates with simulated data showing high variances at low gene intensities, although at the cost of low true positive rates. The Local Pooled Error (LPE) test introduced a bias that lowered false positive rates below theoretically expected values and had lower power relative to the top performers. The standard two-sample t-test and fold change were also found to be sub-optimal for detecting differentially expressed genes. The generalized log transformation was shown to be beneficial in improving results with certain data sets, in particular high variance cDNA data.
Pre-processing of data influences performance and the proper combination of pre-processing and statistical testing is necessary for obtaining the best results. All three Empirical Bayes methods assessed in our study are good choices for statistical tests for small n microarray studies for both Affymetrix and cDNA data. Choice of method for a particular study will depend on software and normalization preferences.
Though microarray experiments are very popular in life science research, managing and analyzing microarray data are still challenging tasks for many biologists. Most microarray programs require users to have sophisticated knowledge of mathematics, statistics and computer skills for usage. With accumulating microarray data deposited in public databases, easy-to-use programs to re-analyze previously published microarray data are in high demand.
EzArray is a web-based Affymetrix expression array data management and analysis system for researchers who need to organize microarray data efficiently and get data analyzed instantly. EzArray organizes microarray data into projects that can be analyzed online with predefined or custom procedures. EzArray performs data preprocessing and detection of differentially expressed genes with statistical methods. All analysis procedures are optimized and highly automated so that even novice users with limited pre-knowledge of microarray data analysis can complete initial analysis quickly. Since all input files, analysis parameters, and executed scripts can be downloaded, EzArray provides maximum reproducibility for each analysis. In addition, EzArray integrates with Gene Expression Omnibus (GEO) and allows instantaneous re-analysis of published array data.
EzArray is a novel Affymetrix expression array data analysis and sharing system. EzArray provides easy-to-use tools for re-analyzing published microarray data and will help both novice and experienced users perform initial analysis of their microarray data from the location of data storage. We believe EzArray will be a useful system for facilities with microarray services and laboratories with multiple members involved in microarray data analysis. EzArray is freely available from .
Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available.
We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from . All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN.
varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.
The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells.
The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques.
BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets.
It is well known that Affymetrix microarrays are widely used to predict genome-wide gene expression and genome-wide genetic polymorphisms from RNA and genomic DNA hybridization experiments, respectively. It has recently been proposed to integrate the two predictions by use of RNA microarray data only. Although the ability to detect single feature polymorphisms (SFPs) from RNA microarray data has many practical implications for genome study in both sequenced and unsequenced species, it raises enormous challenges for statistical modelling and analysis of microarray gene expression data for this objective. Several methods are proposed to predict SFPs from the gene expression profile. However, their performance is highly vulnerable to differential expression of genes. The SFPs thus predicted are eventually a reflection of differentially expressed genes rather than genuine sequence polymorphisms. To address the problem, we developed a novel statistical method to separate the binding affinity between a transcript and its targeting probe and the parameter measuring transcript abundance from perfect-match hybridization values of Affymetrix gene expression data. We implemented a Bayesian approach to detect SFPs and to genotype a segregating population at the detected SFPs. Based on analysis of three Affymetrix microarray datasets, we demonstrated that the present method confers a significantly improved robustness and accuracy in detecting the SFPs that carry genuine sequence polymorphisms when compared to its rivals in the literature. The method developed in this paper will provide experimental genomicists with advanced analytical tools for appropriate and efficient analysis of their microarray experiments and biostatisticians with insightful interpretation of Affymetrix microarray data.
One of the ultimate goals of genomics is to explore structural and functional variations of all genes in a genome. High-density oligo-microarray techniques enable prediction of genome-wide gene expression and genome-wide genetic polymorphisms from using RNA and genomic DNA samples, respectively. A recent proposal to integrate the two predictions by use of RNA microarray data alone has great practical implications in genomics. However, it is essential but very challenging to develop an appropriate analytical method for detecting genetic polymorphisms (SFPs) from RNA expression data, which are inherently coupled with various sources of biological and technical variations. This paper presents a novel statistical approach to detect SFPs from gene expression data. We demonstrated that the new method is significantly more robust to variation due to differential expression of genes and improves the reliability of calling SFPs that bear genuine sequence polymorphisms than the other five methods in the mainstream literature on SFP prediction from microarray data. The improved predictability of detecting SFPs not only confers accuracy in evaluating gene expression from microarray information, but also opens up an opportunity to integrate structural and functional analyses by using only one set of microarray data.
Regulation of gene expression is relevant to many areas of biology and medicine, in the study of treatments, diseases, and developmental stages. Microarrays can be used to measure the expression level of thousands of mRNAs at the same time, allowing insight into or comparison of different cellular conditions. The data derived out of microarray experiments is highly dimensional and often noisy, and interpretation of the results can get intricate. Although programs for the statistical analysis of microarray data exist, most of them lack an integration of analysis results and biological interpretation.
We have developed GEPAT, Genome Expression Pathway Analysis Tool, offering an analysis of gene expression data under genomic, proteomic and metabolic context. We provide an integration of statistical methods for data import and data analysis together with a biological interpretation for subsets of probes or single probes on the chip. GEPAT imports various types of oligonucleotide and cDNA array data formats. Different normalization methods can be applied to the data, afterwards data annotation is performed. After import, GEPAT offers various statistical data analysis methods, as hierarchical, k-means and PCA clustering, a linear model based t-test or chromosomal profile comparison. The results of the analysis can be interpreted by enrichment of biological terms, pathway analysis or interaction networks. Different biological databases are included, to give various information for each probe on the chip. GEPAT offers no linear work flow, but allows the usage of any subset of probes and samples as a start for a new data analysis. GEPAT relies on established data analysis packages, offers a modular approach for an easy extension, and can be run on a computer grid to allow a large number of users. It is freely available under the LGPL open source license for academic and commercial users at .
GEPAT is a modular, scalable and professional-grade software integrating analysis and interpretation of microarray gene expression data. An installation available for academic users can be found at .
Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study.
Though the use of microarrays to identify differentially expressed (DE) genes has become commonplace, it is still not a trivial task. Microarray data are notorious for being noisy, and current DE gene methods do not fully utilize pre-existing biological knowledge to help control this noise. One such source of knowledge is the vast number of publicly available microarray datasets. To leverage this information, we have developed the SVD Augmented Gene expression Analysis Tool (SAGAT) for identifying DE genes. SAGAT extracts transcriptional modules from publicly available microarray data and integrates this information with a dataset of interest. We explore SAGAT's ability to improve DE gene identification on simulated data, and we validate the method on three highly replicated biological datasets. Finally, we demonstrate SAGAT's effectiveness on a novel human dataset investigating the transcriptional response to insulin resistance. Use of SAGAT leads to an increased number of insulin resistant candidate genes, and we validate a subset of these with qPCR. We provide SAGAT as an open source R package that is applicable to any human microarray study.
Analysis of DNA microarray data takes as input spot intensity measurements from scanner software and returns differential expression of genes between two conditions, together with a statistical significance assessment. This process typically consists of two steps: data normalization and identification of differentially expressed genes through statistical analysis. The Expresso microarray experiment management system implements these steps with a two-stage, log-linear ANOVA mixed model technique, tailored to individual experimental designs. The complement of tools in TM4, on the other hand, is based on a number of preset design choices that limit its flexibility. In the TM4 microarray analysis suite, normalization, filter, and analysis methods form an analysis pipeline. TM4 computes integrated intensity values (IIV) from the average intensities and spot pixel counts returned by the scanner software as input to its normalization steps. By contrast, Expresso can use either IIV data or median intensity values (MIV). Here, we compare Expresso and TM4 analysis of two experiments and assess the results against qRT-PCR data.
The Expresso analysis using MIV data consistently identifies more genes as differentially expressed, when compared to Expresso analysis with IIV data. The typical TM4 normalization and filtering pipeline corrects systematic intensity-specific bias on a per microarray basis. Subsequent statistical analysis with Expresso or a TM4 t-test can effectively identify differentially expressed genes. The best agreement with qRT-PCR data is obtained through the use of Expresso analysis and MIV data.
The results of this research are of practical value to biologists who analyze microarray data sets. The TM4 normalization and filtering pipeline corrects microarray-specific systematic bias and complements the normalization stage in Expresso analysis. The results of Expresso using MIV data have the best agreement with qRT-PCR results. In one experiment, MIV is a better choice than IIV as input to data normalization and statistical analysis methods, as it yields as greater number of statistically significant differentially expressed genes; TM4 does not support the choice of MIV input data. Overall, the more flexible and extensive statistical models of Expresso achieve more accurate analytical results, when judged by the yardstick of qRT-PCR data, in the context of an experimental design of modest complexity.