The Remote Analysis Computation for gene Expression data (RACE) suite is a collection of bioinformatics web tools designed for the analysis of DNA microarray data. RACE performs probe-level data preprocessing, extensive quality checks, data visualization and data normalization for Affymetrix GeneChips. In addition, it offers differential expression analysis on normalized expression levels from any array platform. RACE estimates the false discovery rates of lists of potentially regulated genes and provides a Gene Ontology-term analysis tool for GeneChip data to support the biological interpretation and annotation of results. The analysis is fully automated but can be customized by flexible parameter settings. To offer a convenient starting point for subsequent analyses, and to provide maximum transparency, the R scripts used to generate the results can be downloaded along with the output files. RACE is freely available for use at .
Genome-wide expression profiling is a powerful tool for implicating novel gene ensembles in cellular mechanisms of health and disease. The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data. Here, we address these critical issues and offer a solution. We identified several classes of problems at the individual probe level in the existing annotation, under the assumption that current genome and transcriptome databases are more accurate than those used for GeneChip design. We then reorganized probes on more than a dozen popular GeneChips into gene-, transcript- and exon-specific probe sets in light of up-to-date genome, cDNA/EST clustering and single nucleotide polymorphism information. Comparing analysis results between the original and the redefined probe sets reveals ∼30–50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method. Our results demonstrate that the original Affymetrix probe set definitions are inaccurate, and many conclusions derived from past GeneChip analyses may be significantly flawed. It will be beneficial to re-analyze existing GeneChip data with updated probe set definitions.
DNA microarrays have become a nearly ubiquitous tool for the study of human disease, and nowhere is this more true than in cancer. With hundreds of studies and thousands of expression profiles representing the majority of human cancers completed and in public databases, the challenge has been effectively accessing and using this wealth of data.
To address this issue we have collected published human cancer gene expression datasets generated on the Affymetrix GeneChip platform, and carefully annotated those studies with a focus on providing accurate sample annotation. To facilitate comparison between datasets, we implemented a consistent data normalization and transformation protocol and then applied stringent quality control procedures to flag low-quality assays.
The resulting resource, the GeneChip Oncology Database, is available through a publicly accessible website that provides several query options and analytical tools through an intuitive interface.
Alternative splicing of pre-messenger RNA results in RNA variants with combinations of selected exons. It is one of the essential biological functions and regulatory components in higher eukaryotic cells. Some of these variants are detectable with the Affymetrix GeneChip® that uses multiple oligonucleotide probes (i.e. probe set), since the target sequences for the multiple probes are adjacent within each gene. Hybridization intensity from a probe correlates with abundance of the corresponding transcript. Although the multiple-probe feature in the current GeneChip® was designed to assess expression values of individual genes, it also measures transcriptional abundance for a sub-region of a gene sequence. This additional capacity motivated us to develop a method to predict alternative splicing, taking advance of extensive repositories of GeneChip® gene expression array data.
We developed a two-step approach to predict alternative splicing from GeneChip® data. First, we clustered the probes from a probe set into pseudo-exons based on similarity of probe intensities and physical adjacency. A pseudo-exon is defined as a sequence in the gene within which multiple probes have comparable probe intensity values. Second, for each pseudo-exon, we assessed the statistical significance of the difference in probe intensity between two groups of samples. Differentially expressed pseudo-exons are predicted to be alternatively spliced. We applied our method to empirical data generated from GeneChip® Hu6800 arrays, which include 7129 probe sets and twenty probes per probe set. The dataset consists of sixty-nine medulloblastoma (27 metastatic and 42 non-metastatic) samples and four cerebellum samples as normal controls. We predicted that 577 genes would be alternatively spliced when we compared normal cerebellum samples to medulloblastomas, and predicted that thirteen genes would be alternatively spliced when we compared metastatic medulloblastomas to non-metastatic ones. We checked the consistency of some of our findings with information in UCSC Human Genome Browser.
The two-step approach described in this paper is capable of predicting some alternative splicing from multiple oligonucleotide-based gene expression array data with GeneChip® technology. Our method employs the extensive repositories of gene expression array data available and generates alternative splicing hypotheses, which can be further validated by experimental studies.
Short oligonucleotide arrays for transcript profiling have been available for several years. Generally, raw data from these arrays are analysed with the aid of the Microarray Analysis Suite or GeneChip Operating Software (MAS or GCOS) from Affymetrix. Recently, more methods to analyse the raw data have become available. Ideally all these methods should come up with more or less the same results. We set out to evaluate the different methods and include work on our own data set, in order to test which method gives the most reliable results.
Calculating gene expression with 6 different algorithms (MAS5, dChip PMMM, dChip PM, RMA, GC-RMA and PDNN) using the same (Arabidopsis) data, results in different calculated gene expression levels. Consequently, depending on the method used, different genes will be identified as differentially regulated. Surprisingly, there was only 27 to 36% overlap between the different methods. Furthermore, 47.5% of the genes/probe sets showed good correlation between the mismatch and perfect match intensities.
After comparing six algorithms, RMA gave the most reproducible results and showed the highest correlation coefficients with Real Time RT-PCR data on genes identified as differentially expressed by all methods. However, we were not able to verify, by Real Time RT-PCR, the microarray results for most genes that were solely calculated by RMA. Furthermore, we conclude that subtraction of the mismatch intensity from the perfect match intensity results most likely in a significant underestimation for at least 47.5% of the expression values. Not one algorithm produced significant expression values for genes present in quantities below 1 pmol. If the only purpose of the microarray experiment is to find new candidate genes, and too many genes are found, then mutual exclusion of the genes predicted by contrasting methods can be used to narrow down the list of new candidate genes by 64 to 73%.
Affymetrix GeneChip microarrays are popular platforms for expression profiling in two types of studies: detection of differential expression computed by p-values of t-test and estimation of fold change between analyzed groups. There are many different preprocessing algorithms for summarizing Affymetrix data. The main goal of these methods is to remove effects of non-specific hybridization, and to optimally combine information from multiple probes annotated to the same transcript. The methods are benchmarked by comparison with reference methods, such as quantitative reverse-transcription PCR (qRT-PCR).
We present a comprehensive analysis of agreement between Affymetrix GeneChip and qRT-PCR results. We analyzed the influence of filtering by fraction Present calls introduced by J.N. McClintick and H.J. Edenberg (2006) and 2 mapping procedures: updated probe sets definitions proposed by Dai et al. (2005) and our "naive mapping" method. Because of evolution of genome sequence annotations since the time when microarrays were designed, we also studied the effect of the annotation release date. These comparisons were prepared for 6 popular preprocessing algorithms (MAS5, PLIER, RMA, GC-RMA, MBEI, and MBEImm) in the 2 above-mentioned types of studies. We used data sets from 6 independent biological experiments. As a measure of reproducibility of microarray and qRT-PCR values, we used linear and rank correlation coefficients.
We show that filtering by fraction Present calls increased correlations for all 6 preprocessing algorithms. We observed the difference in performance of PM-MM and PM-only methods: using MM probes increased correlations in fold change studies, but PM-only methods proved to perform better in detection of differential expression. We recommend using GC-RMA for detection of differential expression and PLIER for estimation of fold change. The use of the more recent annotation improves the results in both types of studies, encouraging re-analysis of old data.
Affymetrix GeneChips and Illumina BeadArrays are the most widely used commercial single channel gene expression microarrays. Public data repositories are an extremely valuable resource, providing array-derived gene expression measurements from many thousands of experiments. Unfortunately many of these studies are underpowered and it is desirable to improve power by combining data from more than one study; we sought to determine whether platform-specific bias precludes direct integration of probe intensity signals for combined reanalysis.
Using Affymetrix and Illumina data from the microarray quality control project, from our own clinical samples, and from additional publicly available datasets we evaluated several approaches to directly integrate intensity level expression data from the two platforms. After mapping probe sequences to Ensembl genes we demonstrate that, ComBat and cross platform normalisation (XPN), significantly outperform mean-centering and distance-weighted discrimination (DWD) in terms of minimising inter-platform variance. In particular we observed that DWD, a popular method used in a number of previous studies, removed systematic bias at the expense of genuine biological variability, potentially reducing legitimate biological differences from integrated datasets.
Normalised and batch-corrected intensity-level data from Affymetrix and Illumina microarrays can be directly combined to generate biologically meaningful results with improved statistical power for robust, integrated reanalysis.
Microarrays have been a popular tool for gene expression profiling at genome-scale for over a decade due to the low cost, short turn-around time, excellent quantitative accuracy and ease of data generation. The Bioconductor package puma incorporates a suite of analysis methods for determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analysis. As isoform level expression profiling receives more and more interest within genomics in recent years, exon microarray technology offers an important tool to quantify expression level of the majority of exons and enables the possibility of measuring isoform level expression. However, puma does not include methods for the analysis of exon array data. Moreover, the current expression summarisation method for Affymetrix 3’ GeneChip data suffers from instability for low expression genes. For the downstream analysis, the method for differential expression detection is computationally intensive and the original expression clustering method does not consider the variance across the replicated technical and biological measurements. It is therefore necessary to develop improved uncertainty propagation methods for gene and transcript expression analysis.
We extend the previously developed Bioconductor package puma with a new method especially designed for GeneChip Exon arrays and a set of improved downstream approaches. The improvements include: (i) a new gamma model for exon arrays which calculates isoform and gene expression measurements and a level of uncertainty associated with the estimates, using the multi-mappings between probes, isoforms and genes, (ii) a variant of the existing approach for the probe-level analysis of Affymetrix 3’ GeneChip data to produce more stable gene expression estimates, (iii) an improved method for detecting differential expression which is computationally more efficient than the existing approach in the package and (iv) an improved method for robust model-based clustering of gene expression, which takes technical and biological replicate information into consideration.
With the extensions and improvements, the puma package is now applicable to the analysis of both Affymetrix 3’ GeneChips and Exon arrays for gene and isoform expression estimation. It propagates the uncertainty of expression measurements into more efficient and comprehensive downstream analysis at both gene and isoform level. Downstream methods are also applicable to other expression quantification platforms, such as RNA-Seq, when uncertainty information is available from expression measurements. puma is available through Bioconductor and can be found at http://www.bioconductor.org.
Interlaboratory comparison of microarray data, even when using the same platform, imposes several challenges to scientists. RNA quality, RNA labeling efficiency, hybridization procedures and data-mining tools can all contribute variations in each laboratory. In Affymetrix GeneChips, about 11–20 different 25-mer oligonucleotides are used to measure the level of each transcript. Here, we report that ‘labeling extension values (LEVs)’, which are correlation coefficients between probe intensities and probe positions, are highly correlated with the gene expression levels (GEVs) on eukayotic Affymetrix microarray data. By analyzing LEVs and GEVs in the publicly available 2414 cel files of 20 Affymetrix microarray types covering 13 species, we found that correlations between LEVs and GEVs only exist in eukaryotic RNAs, but not in prokaryotic ones. Surprisingly, Affymetrix results of the same specimens that were analyzed in different laboratories could be clearly differentiated only by LEVs, leading to the identification of ‘laboratory signatures’. In the examined dataset, GSE10797, filtering out high-LEV genes did not compromise the discovery of biological processes that are constructed by differentially expressed genes. In conclusion, LEVs provide a new filtering parameter for microarray analysis of gene expression and it may improve the inter- and intralaboratory comparability of Affymetrix GeneChips data.
Hyperoxia is specifically toxic to photoreceptors, and this toxicity may be important in the progress of retinal dystrophies. This study examines gene expression induced in the C57BL/6J mouse retina by hyperoxia over the 14-day period during which photoreceptors first resist, then succumb to, hyperoxia.
Young adult C57BL/6J mice were exposed to hyperoxia (75% oxygen) for up to 14 days. On day 0 (control), day 3, day 7, and day 14, retinal RNA was extracted and processed on Affymetrix GeneChip® Mouse Genome 430 2.0 arrays. Microarray data were analyzed using GCOS Version 1.4 and GeneSpring Version 7.3.1. For 15 genes, microarray data were confirmed using relative quantitative real-time reverse transcription polymerase chain reaction techniques.
The overall numbers of hyperoxia-regulated genes increased monotonically with exposure. Within that increase, however, a distinctive temporal pattern was apparent. At 3 days exposure, there was prominent upregulation of genes associated with neuroprotection. By day 14, these early-responsive genes were downregulated, and genes related to cell death were strongly expressed. At day 7, the regulation of these genes was mixed, indicating a possible “transition period” from stability at day 3 to degeneration at day 14. When functional groupings of genes were analyzed separately, there was significant regulation in genes responsive to stress, genes known to cause human photoreceptor dystrophies and genes associated with apoptosis.
Microarray analysis of the response of the retina to prolonged hyperoxia demonstrated a temporal pattern involving early neuroprotection and later cell death, and provided insight into the mechanisms involved in the two phases of response. As hyperoxia is a consistent feature of the late stages of photoreceptor degenerations, understanding the mechanisms of oxygen toxicity may be important therapeutically.
Complex microarray gene expression datasets can be used for many independent analyses and are particularly interesting for the validation of potential biomarkers and multi-gene classifiers. This article presents a novel method to perform correlations between microarray gene expression data and clinico-pathological data through a combination of available and newly developed processing tools.
We developed Survival Online (available at ), a Web-based system that allows for the analysis of Affymetrix GeneChip microarrays by using a parallel version of dChip. The user is first enabled to select pre-loaded datasets or single samples thereof, as well as single genes or lists of genes. Expression values of selected genes are then correlated with sample annotation data by uni- or multi-variate Cox regression and survival analyses. The system was tested using publicly available breast cancer datasets and GO (Gene Ontology) derived gene lists or single genes for survival analyses.
The system can be used by bio-medical researchers without specific computation skills to validate potential biomarkers or multi-gene classifiers. The design of the service, the parallelization of pre-processing tasks and the implementation on an HPC (High Performance Computing) environment make this system a useful tool for validation on several independent datasets.
NASC operates an Affymetrix ‘GeneChip’ (microarray) service for the Arabidopsis thaliana community. All data produced by the service are publicly available through our microarray data base ‘NASCArrays’ published at http://affymetrix.arabidopsis.info. The data are accessible through text searching and a series of data mining tools. All data are annotated with sample preparation details, and the original Affymetrix data are available for download. The database aims to be MIAME supportive and provide a coordinated resource for re searchers interested in the transcriptome of Arabidopsis. Using this database, data produced will be shared with other databases worldwide.
Affymetrix GeneChip microarrays are the most widely used high-throughput technology to measure gene expression, and a wide variety of preprocessing methods have been developed to transform probe intensities reported by a microarray scanner into gene expression estimates. There have been numerous comparisons of these preprocessing methods, focusing on the most common analyses—detection of differential expression and gene or sample clustering. Recently, more complex multivariate analyses, such as gene co-expression, differential co-expression, gene set analysis and network modeling, are becoming more common; however, the same preprocessing methods are typically applied. In this article, we examine the effect of preprocessing methods on some of these multivariate analyses and provide guidance to the user as to which methods are most appropriate.
microarray; preprocessing; gene expression; multivariate analysis
Common bean (Phaseolus vulgaris L.) and soybean (Glycine max) both belong to the Phaseoleae tribe and share significant coding sequence homology. This suggests that the GeneChip® Soybean Genome Array (soybean GeneChip) may be used for gene expression studies using common bean.
To evaluate the utility of the soybean GeneChip for transcript profiling of common bean, we hybridized cRNAs purified from nodule, leaf, and root of common bean and soybean in triplicate to the soybean GeneChip. Initial data analysis showed a decreased sensitivity and accuracy of measuring differential gene expression in common bean cross-species hybridization (CSH) GeneChip data compared to that of soybean. We employed a method that masked putative probes targeting inter-species variable (ISV) regions between common bean and soybean. A masking signal intensity threshold was selected that optimized both sensitivity and accuracy of measuring differential gene expression. After masking for ISV regions, the number of differentially-expressed genes identified in common bean was increased by 2.8-fold reflecting increased sensitivity. Quantitative RT-PCR (qRT-PCR) analysis of 20 randomly selected genes and purine-ureide pathway genes demonstrated an increased accuracy of measuring differential gene expression after masking for ISV regions. We also evaluated masked probe frequency per probe set to gain insight into the sequence divergence pattern between common bean and soybean. The sequence divergence pattern analysis suggested that the genes for basic cellular functions and metabolism were highly conserved between soybean and common bean. Additionally, our results show that some classes of genes, particularly those associated with environmental adaptation, are highly divergent.
The soybean GeneChip is a suitable cross-species platform for transcript profiling in common bean when used in combination with the masking protocol described. In addition to transcript profiling, CSH of the GeneChip in combination with masking probes in the ISV regions can be used for comparative ecological and/or evolutionary genomics studies.
DNA microarrays are a powerful tool for monitoring the expression of tens of thousands of genes simultaneously. With the advance of microarray technology, the challenge issue becomes how to analyze a large amount of microarray data and make biological sense of them. Affymetrix GeneChips are widely used microarrays, where a variety of statistical algorithms have been explored and used for detecting significant genes in the experiment. These methods rely solely on the quantitative data, i.e., signal intensity; however, qualitative data are also important parameters in detecting differentially expressed genes.
AffyMiner is a tool developed for detecting differentially expressed genes in Affymetrix GeneChip microarray data and for associating gene annotation and gene ontology information with the genes detected. AffyMiner consists of the functional modules, GeneFinder for detecting significant genes in a treatment versus control experiment and GOTree for mapping genes of interest onto the Gene Ontology (GO) space; and interfaces to run Cluster, a program for clustering analysis, and GenMAPP, a program for pathway analysis. AffyMiner has been used for analyzing the GeneChip data and the results were presented in several publications.
AffyMiner fills an important gap in finding differentially expressed genes in Affymetrix GeneChip microarray data. AffyMiner effectively deals with multiple replicates in the experiment and takes into account both quantitative and qualitative data in identifying significant genes. AffyMiner reduces the time and effort needed to compare data from multiple arrays and to interpret the possible biological implications associated with significant changes in a gene's expression.
Several preprocessing methods are available for the analysis of Affymetrix Genechips arrays. The most popular algorithms analyze the measured fluorescence intensities with statistical methods. Here we focus on a novel algorithm, AffyILM, available from Bioconductor, which relies on inputs from hybridization thermodynamics and uses an extended Langmuir isotherm model to compute transcript concentrations. These concentrations are then employed in the statistical analysis. We compared the performance of AffyILM and other traditional methods both in the old and in the newest generation of GeneChips.
Tissue mixture and Latin Square datasets (provided by Affymetrix) were used to assess the performances of the differential expression analysis depending on the preprocessing strategy. A correlation analysis conducted on the tissue mixture data reveals that the median-polish algorithm allows to best summarize AffyILM concentrations computed at the probe-level. Those correlation results are equivalent to the best correlations observed using popular preprocessing methods relying on intensity values. The performances of each tested preprocessing algorithm were quantified using the Latin Square HG-U133A dataset, thanks to the comparison of differential analysis results with the list of spiked genes. The figures of merit generated illustrates that the performances associated to AffyILM(medianpolish), inferred from the present statistical analysis, are comparable to the best performing strategies previously reported.
Converting probe intensities to estimates of target concentrations prior to the statistical analysis, AffyILM(medianpolish) is one of the best performing strategy currently available. Using hybridization theory, probe-level estimates of target concentrations should be identically distributed. In the future, a probe-level multivariate analysis of the concentrations should be compared to the univariate analysis of probe-set summarized expression data.
Over the past decade, gene expression microarray studies have greatly expanded our knowledge of genetic mechanisms of human diseases. Meta-analysis of substantial amounts of accumulated data, by integrating valuable information from multiple studies, is becoming more important in microarray research. However, collecting data of special interest from public microarray repositories often present major practical problems. Moreover, including low-quality data may significantly reduce meta-analysis efficiency.
M2DB is a human curated microarray database designed for easy querying, based on clinical information and for interactive retrieval of either raw or uniformly pre-processed data, along with a set of quality-control metrics. The database contains more than 10,000 previously published Affymetrix GeneChip arrays, performed using human clinical specimens. M2DB allows online querying according to a flexible combination of five clinical annotations describing disease state and sampling location. These annotations were manually curated by controlled vocabularies, based on information obtained from GEO, ArrayExpress, and published papers. For array-based assessment control, the online query provides sets of QC metrics, generated using three available QC algorithms. Arrays with poor data quality can easily be excluded from the query interface. The query provides values from two algorithms for gene-based filtering, and raw data and three kinds of pre-processed data for downloading.
M2DB utilizes a user-friendly interface for QC parameters, sample clinical annotations, and data formats to help users obtain clinical metadata. This database provides a lower entry threshold and an integrated process of meta-analysis. We hope that this research will promote further evolution of microarray meta-analysis.
The original spotted array technology with competitive hybridization of two experimental samples and measuring relative expression levels is increasingly displaced by more accurate platforms that allow determining absolute expression values for a single sample (for example, Affymetrix GeneChip and Illumina BeadChip). Unfortunately, cross-platform comparisons show a disappointingly low concordance between lists of regulated genes between the latter two platforms.
Whereas expression values determined with a single Affymetrix GeneChip represent single measurements, the expression results obtained with Illumina BeadChip are essentially statistical means from several dozens of identical probes. In the case of multiple technical replicates, the data require, therefore, different stistical treatment depending on the platform. The key is the computation of the squared standard deviation within replicates in the case of the Illumina data as weighted mean of the square of the standard deviations of the individual experiments. With an Illumina spike experiment, we demonstrate dramatically improved significance of spiked genes over all relevant concentration ranges. The re-evaluation of two published Illumina datasets (membrane type-1 matrix metalloproteinase expression in mammary epithelial cells by Golubkov et al. Cancer Research (2006) 66, 10460; spermatogenesis in normal and teratozoospermic men, Platts et al. Human Molecular Genetics (2007) 16, 763) significantly identified more biologically relevant genes as transcriptionally regulated targets and, thus, additional biological pathways involved.
The results in this work show that it is important to process Illumina BeadChip data in a modified statistical procedure and to compute the standard deviation in experiments with technical replicates from the standard errors of individual BeadChips. This change leads also to an improved concordance with Affymetrix GeneChip results as the spermatogenesis dataset re-evaluation demonstrates.
This article was reviewed by I. King Jordan, Mark J. Dunning and Shamil Sunyaev.
Improvements in genome sequence annotation revealed discrepancies in the original probeset/gene assignment in Affymetrix microarray and the existence of differences between annotations and effective alignments of probes and transcription products. In the current generation of Affymetrix human GeneChips, most probesets include probes matching transcripts from more than one gene and probes which do not match any transcribed sequence.
We developed a novel set of custom Chip Definition Files (CDF) and the corresponding Bioconductor libraries for Affymetrix human GeneChips, based on the information contained in the GeneAnnot database. GeneAnnot-based CDFs are composed of unique custom-probesets, including only probes matching a single gene.
GeneAnnot-based custom CDFs solve the problem of a reliable reconstruction of expression levels and eliminate the existence of more than one probeset per gene, which often leads to discordant expression signals for the same transcript when gene differential expression is the focus of the analysis. GeneAnnot CDFs are freely distributed and fully compliant with Affymetrix standards and all available software for gene expression analysis. The CDF libraries are available from , along with supplementary information (CDF libraries, installation guidelines and R code, CDF statistics, and analysis results).
The MIPS Fusarium graminearum Genome Database (FGDB) is a comprehensive genome database on one of the most devastating fungal plant pathogens of wheat and barley. FGDB provides information on two gene sets independently derived by automated annotation of the F.graminearum genome sequence. A complete manually revised gene set will be completed within the near future. The initial results of systematic manual correction of gene calls are already part of the current gene set. The database can be accessed to retrieve information from bioinformatics analyses and functional classifications of the proteins. The data are also organized in the well established MIPS catalogs and novel query techniques are available to search the data. The comprehensive set of gene calls was also used for the design of an Affymetrix GeneChip. The resource is accessible on .
CARMAweb (Comprehensive R-based Microarray Analysis web service) is a web application designed for the analysis of microarray data. CARMAweb performs data preprocessing (background correction, quality control and normalization), detection of differentially expressed genes, cluster analysis, dimension reduction and visualization, classification, and Gene Ontology-term analysis. This web application accepts raw data from a variety of imaging software tools for the most widely used microarray platforms: Affymetrix GeneChips, spotted two-color microarrays and Applied Biosystems (ABI) microarrays. R and packages from the Bioconductor project are used as an analytical engine in combination with the R function Sweave, which allows automatic generation of analysis reports. These report files contain all R commands used to perform the analysis and guarantee therefore a maximum transparency and reproducibility for each analysis. The web application is implemented in Java based on the latest J2EE (Java 2 Enterprise Edition) software technology. CARMAweb is freely available at .
Affymetrix GeneChip Array and Massively Parallel Signature Sequencing (MPSS) are two high throughput methodologies used to profile transcriptomes. Each method has certain strengths and weaknesses; however, no comparison has been made between the data derived from Affymetrix arrays and MPSS. In this study, two lineage-related prostate cancer cell lines, LNCaP and C4-2, were used for transcriptome analysis with the aim of identifying genes associated with prostate cancer progression.
Affymetrix GeneChip array and MPSS analyses were performed. Data was analyzed with GeneSpring 6.2 and in-house perl scripts. Expression array results were verified with RT-PCR.
Comparison of the data revealed that both technologies detected genes the other did not. In LNCaP, 3,180 genes were only detected by Affymetrix and 1,169 genes were only detected by MPSS. Similarly, in C4-2, 4,121 genes were only detected by Affymetrix and 1,014 genes were only detected by MPSS. Analysis of the combined transcriptomes identified 66 genes unique to LNCaP cells and 33 genes unique to C4-2 cells. Expression analysis of these genes in prostate cancer specimens showed CA1 to be highly expressed in bone metastasis but not expressed in primary tumor and EPHA7 to be expressed in normal prostate and primary tumor but not bone metastasis.
Our data indicates that transcriptome profiling with a single methodology will not fully assess the expression of all genes in a cell line. A combination of transcription profiling technologies such as DNA array and MPSS provides a more robust means to assess the expression profile of an RNA sample. Finally, genes that were differentially expressed in cell lines were also differentially expressed in primary prostate cancer and its metastases.
Gene microarray analyses represent potentially effective means for high-throughput gene expression profiling in nonhuman primates. In the companion article we emphasize effective experimental design based on the in vivo physiology of the rhesus macaque, whereas this article emphasizes considerations for gene annotation and data interpretation using gene microarray platforms from Affymetrix®. Initial annotation of the rhesus genome array was based on Affymetrix® human GeneChips®. However, annotation revisions improve the precision with which rhesus transcripts are identified. Annotation of the rhesus GeneChip® is under continuous revision with large percentages of probesets under multiple annotation systems having undergone multiple reassignments between March 2007 and November 2008. It is also important to consider that quantitation and comparison of gene expression levels across multiple chips requires appropriate normalization. External corroboration of microarray results using PCR-based methodology also requires validation of appropriate internal reference genes for normalization of expression values. Many tools are now freely available to aid investigators with microarray normalization and selection of internal reference genes to be used for independent corroboration of microarray results.
Macaca Mulatta; Microarray; GeneChip® Rhesus Macaque Genome Array
The application of microarray hybridization theory to Affymetrix GeneChip data has been a recent focus for data analysts. It has been shown that the hyperbolic Langmuir isotherm captures the shape of the signal response to concentration of Affymetrix GeneChips. We demonstrate that existing linear fit methods for extracting gene expression measures are not well adapted for the effect of saturation resulting from surface adsorption processes. In contrast to the most popular methods, we fit background and concentration parameters within a single global fitting routine instead of estimating the background before obtaining gene expression measures. We describe a non-linear multi-chip model of the perfect match signal that effectively allows for the separation of specific and non-specific components of the microarray signal and avoids saturation bias in the high-intensity range. Multimodel inference, incorporated within the fitting routine, allows a quantitative selection of the model that best describes the observed data. The performance of this method is evaluated on publicly available datasets, and comparisons to popular algorithms are presented.
With an abundant amount of microarray gene expression data sets available through public repositories, new possibilities lie in combining multiple existing data sets. In this new context, analysis itself is no longer the problem, but retrieving and consistently integrating all this data before delivering it to the wide variety of existing analysis tools becomes the new bottleneck.
We present the newly released inSilicoMerging R/Bioconductor package which, together with the earlier released inSilicoDb R/Bioconductor package, allows consistent retrieval, integration and analysis of publicly available microarray gene expression data sets. Inside the inSilicoMerging package a set of five visual and six quantitative validation measures are available as well.
By providing (i) access to uniformly curated and preprocessed data, (ii) a collection of techniques to remove the batch effects between data sets from different sources, and (iii) several validation tools enabling the inspection of the integration process, these packages enable researchers to fully explore the potential of combining gene expression data for downstream analysis. The power of using both packages is demonstrated by programmatically retrieving and integrating gene expression studies from the InSilico DB repository [https://insilicodb.org/app/].
Batch effect removal; Data integration; Gene expression; Microarray repositories; InSilico DB; Reproducibility