We have carried out an extensive assessment of the performance of three gene expression profiling platforms designed for two mammalian genomes and provide information on analytical methods that are best suited to processing data from specific types of arrays. Even though comparisons of data from multiple microarray technologies have been extensively tested [
7,
21], including in advanced high density array systems such as Illumina and Affymetrix [
11], we more specifically focused on the impact of normalisation methods on multiple concordance criteria (raw intensity, expression ratio, statistical significance), which remain partly addressed in the literature, to assess the extent of cross-platform data consistency and divergence. Overall, using genome-wide gene expression profiling data of rat and mouse genomes we provide confirmatory evidence of the extremely high agreement between platforms as previously suggested [
8], and the particularly high consistency of results at a gene level between Illumina and Affymetrix [
11]. Both platforms agree less well with Operon-generated data, and this technology also correlates less well with independent qRT-PCR data. Both Illumina and Affymetrix had outstanding correlation with qRT-PCR results, indicating that they produce highly reliable fold change results on a gene level, even for a modest number of biological replicates.
Multiple strain comparisons were used and often had large impacts on the agreement of results from the platforms, due mainly to genetic differences. The comparison between inbred rat strains for hepatic and renal gene expression, which was repeated on Affymetrix, Illumina and Operon microarrays, showed that the kidney appeared to produce less reproducible data, perhaps due to morphological heterogeneity of this organ. However, the correlation between the fold changes generated by qRT-PCR, a highly reliable method of measuring transcription [
20,
22] and those found by Affymetrix and Illumina were outstanding, exceeding a Pearson correlation of 0.976.
Perhaps surprisingly, correlations achieved between these microarray systems and the independent qRT-PCR technique were higher than the inter-platform comparisons using sequence or Ensembl Gene matching techniques, and even higher than between distinct normalisations on the same platform. However, those methods attempted to match oligonucleotides on a genome-wide level, while qRT-PCR comparisons use a small number of known genes, specifically chosen for their biological role and/or high differential expression in the microarray experiments.
Data from our cross-platform comparisons were improved by methods for probe alignment, which were based on sequence identity and assignment to the Ensembl. The majority of published cross-platform analyses have used methods based on identifiers (eg. gene names and accession numbers), which represent significant challenges due to the existence of synonyms and evolving or inconsistent annotations. Our inter-platform correlations in log fold changes and agreements in the most affected genes obtained using "target" sequence identity were often surpassed by aligning the oligonucleotides to Ensembl gene sequences. This is surprising, as the precise matching of oligonucleotide sequences within a gene provided by "target matching" is expected to be most reproducible, as the effects of complicating factors, such as alternative splicing, are reduced. However, the removal of poorly annotated probes in a probe set and combining information across entire genes create more reproducible results, and demonstrate the power of this novel alignment tool.
The Affymetrix signal extractions showed a much larger heterogeneity than the background corrections and normalisations used on the other platforms. This is largely due to the way the methods treat the more complicated design of the Affymetrix platform (distinct probe sequences in a probe set and MM probes).
RMA was the method which performed consistently well in all comparisons. It produced high correlation in log fold change of gene expression between platforms, regardless of filtering, annotation or biological comparison and high agreement in gene lists for both statistical significance and fold change magnitude. However, the fold change can be underestimated by RMA [
23]. This was suggested by the mouse study, in comparison with Illumina, and the rat kidney comparison with qRT-PCR data, although fold changes of gene expression in the rat datasets were comparable with those from Illumina and Operon arrays. The related method GC-RMA also performed very well in a large number of comparisons [
24], correlating the most in terms of intensity, having high correlation in gene expression fold changes, high agreement in most gene lists, and producing comparable gene expression fold changes with qRT-PCR. However, it produced lower correlations with Operon and lower p value agreement for the rat liver dataset.
The Affymetrix own method MAS 5.0 and the popular Li-Wong method showed very mixed results, especially extreme for MAS 5.0. These methods showed poor agreement with other platforms when all genes were used, but occasionally produced excellent results when intensity-based filtering was utilised. These methods often showed extremely high agreement in top fold change and (especially Li-Wong) p value lists. These two methods also produced very high correlations with the qRT-PCR data, and MAS 5.0 fold change was almost identical. These findings suggest that these methods may be useful when searching for large effect sizes in highly-expressed known genes, but are poor for whole genome studies or detection of effects of small magnitude. Other Affymetrix normalisations, which used the PM values without correction often showed intermediate results. This implies that using either the mismatch values or a statistical framework to calculate and remove background effects is likely to improve the accuracy, sensitivity and reproducibility of transcriptomic data generated using the Affymetrix platform.