Intra-platform technical reproducibility
The intra-platform technical reproducibility can and should be high, but appears to be low in Tan's study [
11], particularly for the Affymetrix platform. Specifically, intensity correlation of technical replicates for the Affymetrix data is low compared to data from others researchers [
13,
16,
23] and our collaborators. A direct consequence of low LIr
2 (log intensity correlation squared) is very low LRr
2 (log ratio correlation squared): an average of 0.11 and 0.54 for before and after data filtering, respectively, corresponding to an average POG (percentage of overlapping genes) of 13% and 51% (based on the gene selection method of fold-change ranking), respectively (Tables and ). That is, when all 2009 genes are considered, only about 13% of the genes are expected to be in common between any two pairs of Affymetrix technical replicates, if 100 genes (50 up and 50 down) are selected from each replicate. In contrast, the percentage of commonly identified genes from two pairs of technical replicates is expected to be around 51% when the analysis is limited to the subset of 537 highly expressed genes. Figure gives typical scatter plots showing the correlation of log intensity (Figures and ) and log ratio (Figures and ) data from the Affymetrix platform that indicate a low intra-platform consistency, especially before data filtering. The low intra-platform consistency is much more apparent for data in the log ratio space (Figures and ). Since a primary purpose of a microarray gene expression study is to detect the difference in expression levels (
i.e., fold-change or ratio), it is important to assess data consistency in the log ratio space (Figures and ) in addition to in the log intensity space (Figures and ).
| Table 1Data consistency for the dataset of 2009 genes (before data filtering). The pair-wise log ratio correlation squared (LRr2, lower triangle) and the percentage of overlapping genes (POG, upper triangle) are listed. T1, T2, and T3 are technical replicates; (more ...) |
| Table 2Data consistency for the dataset of 537 genes (after data filtering). The pair-wise log ratio correlation squared (LRr2, lower triangle) and the percentage of overlapping genes (POG, upper triangle) are listed. T1, T2, and T3 are technical replicates; (more ...) |
Technical reproducibility appears to be reasonable on the Amersham platform: average LRr2 is 0.77 and 0.94 for the three pairs of technical replicates before and after data filtering, corresponding to a POG of 76% and 89%, respectively. For the Agilent platform, technical replicate pairs T1 and T2 appear to be very similar, but markedly different from T3 (Figure ). It is notable that the Cy5 intensities for a subset of spots with lower intensities for one hybridization of the dye-swap pair of T3 are significantly different from those of T1 and T2 (data not shown). The difference between T3 and T1 or T2 is much reduced after data filtering (Figure ), largely owing to the removal of the outlying lower intensity spots in T3. Overall, average LRr2 on the Agilent platform is 0.70 and 0.94 for the three pairs of technical replicates before and after data filtering, corresponding to a POG of 62% and 84%, respectively.
It is evident from Figure that intra-platform consistency of the Affymetrix data from Tan's study is much lower than that of the Amersham and Agilent platforms. A thorough evaluation of experimental procedures would be needed to better understand such poor performance of the Affymetrix platform from Tan's study.
Intra-platform biological reproducibility
The intra-platform biological reproducibility appears to be low (Figures and , and Tables and ) for all three platforms. Biological replicate pairs B2 and B3 appear to be quite similar in the Agilent platform (with LRr2 of 0.85 and 0.95, and POG of 73% and 85%, respectively, for before and after data filtering). B1, however, which is represented by the average of the three pairs of technical replicates (T1, T2, and T3), appears to be quite different from B2 and B3, with an average LRr2 of 0.41 and 0.52, and POG of 37% and 49%, respectively, for before and after data filtering. The difference between B1 and B2 or B3 on the Amersham platform is also noticeable: with average LRr2 of 0.49 and 0.61, and POG of 44% and 54%, respectively, for before and after data filtering; whereas B2 and B3 shows a higher LRr2 of 0.53 and 0.78, and POG of 49% and 71% for before and after data filtering, respectively. Because of the low technical reproducibility of the Affymetrix data, it is not surprising that the biological reproducibility from the Affymetrix platform is low: with average LRr2 of 0.10 and 0.45, and POG of 14% and 45% for before and after data filtering, respectively (Tables and ). One possible cause of the observed low biological reproducibility could be large experimental variations during the processes of cell culture and/or RNA sample preparation.
Impact of data (noise) filtering
All 2009 genes, regardless of their signal reliability, are used in Tan's original analysis [
11]. After adopting Barczak
et al.'s data filtering procedure [
9] by excluding 50% of the genes with the lowest average intensity on each platform, a subset of 537 genes having more reliable intensity measurement is obtained. As expected, a significant increase in both technical and biological reproducibility is observed (Figures and ; notice the different scales shown in the distance metric). The impact of data filtering on data reproducibility is more apparent from Figures and when log ratios from technical replicate pairs T1 and T2 on the Affymetrix platform are compared. This simple data filtering procedure appears justifiable for cross-platform comparability studies, assuming that genes tiled on a microarray represent a random sampling of all the genes coded by a genome, and that only a (small) portion of the genes coded by the genome are expected to be expressed in a single cell type under any given biological condition; such is the case for the PANC-1 cells investigated in Tan's study [
11].
Another subset consisting of 1472 genes that showed intensity above the median on at least one platform was subjected to the same analyses discussed for the datasets of 2009 and 537 genes. Gene identification was also conducted individually on each platform using the 50% of genes above the median average intensity, and the concordance was then compared using the three significant gene lists. In both cases, the identified cross-platform concordance was somewhere between that of the 2009-gene and 537-gene datasets (data not shown).
Cross-platform comparability
For each platform, the LR values of the three pairs of biological replicates (B1, B2, and B3) were averaged gene-wise and rank-ordered, and a list of 100 genes (50 up- and down-regulated) was identified. Without data filtering, 20 genes were identified to be in common by SAM (Figure ). With data filtering, 51 to 58 genes were found in common between any two platforms (Table ), and 39 genes were in common to the three platforms, which identified a total of 172 unique genes (Figure ). While the overlap of 39 out of 172 is still low, the cross-platform concordance is some 10-fold higher than suggested by Tan's analysis (Figure ). The higher concordance reported here is a direct consequence of the data analysis procedure that incorporates filtering out genes of less reliability, selecting genes based on fold-change ranking rather than by a p-value cutoff, and selecting gene lists of equal length for each platform and for each regulation direction.
Impact of gene selection methods on cross-platform comparability
As increasingly advanced statistical methods have been proposed for identifying differentially expressed genes, the validity and reliability of the more simple and "conventional" gene selection method by fold-change cutoff have been frequently questioned [
24,
25]. To compare the aforementioned results based on fold-change ranking with more
statistically "valid" methods, we also applied SAM [
17] and
p-value ranking to the filtered subset of 537 genes to select 100 genes (50 up and 50 down-regulated) from the three pairs of biological replicates on each platform. For SAM, the POG between any two platforms ranged from 48% (Amersham-Agilent) to 58% (Affymetrix-Agilent), and 34 genes were found in common to the three platforms (Table ). Of the 34 genes, 31 (91%) also appeared in the list of 39 genes selected solely based on fold-change ranking. Furthermore, 100 genes were also selected from each platform solely based on
p-value ranking of the t-tests on the three pairs of biological replicate pairs, and 19 of them were found in common to the three platforms. Among the 19 genes, 11 (58%) appeared in the list of 39 genes selected by fold-change ranking.
| Table 3Percentage of overlapping genes (POG) determined by three gene selection methods. For each gene selection method, different percentages of genes (P) are selected from each platform. |
However, when the three gene selection methods (i.e., p-value ranking, fold-change ranking, and SAM) were applied to the dataset of 2009 genes to select 100 genes from each platform (50 up and 50 down), much lower cross-platform concordance was obtained (Table ): only 6, 14, and 20 genes were found in common to the three platforms by using p-value ranking, fold-change ranking, and SAM, respectively. The results indicate the importance of data (noise) filtering in microarray data analysis and the larger impact of the choice of gene selection methods on cross-platform concordance when the noise level is higher.
It is important to note that in both cases (2009-gene dataset and 537-gene dataset), p-value ranking yielded the lowest cross-platform concordance (Table ). One explanation is that the p-value ranking method selected many genes with outstanding "statistical" significance but a very small fold change. Such a small fold change from one platform may be by chance or due to platform-dependent systematic noise structures (e.g., hybridization patterns). Thus, such a small fold change is unlikely to be reliably detectable on other platforms, leading to low cross-platform concordance. For example, the gene (ID#1623) ranked as the most significant in up-regulation from the Affymetrix platform, exhibited a very "reproducible" log ratio measurement for the three biological replicate pairs (0.1620, 0.1624, and 0.1580, with a mean of 0.1608 and standard deviation of 0.002465). The p-value of the two-tailed Student t-test was 0.000078, representing the most statistically significant gene from the Affymetrix platform. However, the average log ratio of 0.1608 corresponds to a fold change of merely 1.12 (i.e., 12% increase in mRNA level). Such a small fold change is generally regarded as questionable by microarray technology currently available. On the Amersham platform, log ratios for the three replicates were -0.3648, 0.01624, and 0.04559, with a mean of -0.1010 (a fold change of -0.93, i.e., down-regulation by 7%), standard deviation of 0.2289, and p = 0.52. On the Agilent platform, log ratios for the three replicates were -0.1865, 0.2698, and 0.05786, with a mean of 0.04705 (a fold change of 1.03, i.e., up-regulation by 3%), standard deviation of 0.2283, and p = 0.75. In terms of p-value, this gene (ID#1623) was ranked as #1621 and #1785 out of 2009 genes on the Amersham and Agilent platforms, respectively; neither of these two platforms selected this gene as significant. When fold-change and SAM were applied for ranking genes based on the same Affymetrix data, the ranking of this gene was very low (ranked around #900 out of 2009 genes). Obviously, this gene was not selected by fold-change ranking owing to its small fold change (1.12).
Although fold-change ranking showed reasonable performance in terms of cross-platform concordance when applied to the subset of 537 genes, it is susceptible to selecting genes with a large fold change and large variability when the dataset is of low reproducibility, as is the case for the dataset with all 2009 genes. For example, one gene (ID#1245) was ranked as the 11th largest fold change in up-regulation on the Affymetrix platform, but was only ranked in the top 500 and 120 by p-value ranking and SAM, respectively. The reason is that although this gene exhibited an average log ratio of 2.3432 (5.07-fold up-regulation), there was a large variability in the three biological replicate pairs (2.8986, 0.07195, and 4.0589), with a standard deviation of 2.058 and p = 0.19. The detected log ratios on the Amersham and Agilent platforms were 0.2955 (a fold change of 1.2273, p = 0.25) and 0.7566 (a fold change of 1.6895, p = 0.17), respectively, leading to a low ranking by both platforms either with fold-change ranking or p-value ranking.
SAM ranks genes based on a modified statistic similar to t-test:
delta =
u/(
s+
s0), where
u stands for mean log ratio,
s is defined as sqrt(
sd2/
n), and
n is the number of replicates. By incorporating a fudge factor
s0 in the denominator, in the calculation of
delta, hence the ranking of genes, SAM effectively ranks genes relatively low in situations where either both
u and
sd are small, or when
u and
sd are both large [
17]. Genes falling into these two situations will be ranked high by
p-value ranking and fold-change ranking, respectively. Intuitively, SAM finds a tradeoff between fold-change and
p-value, and should be regarded as a preferred gene selection method over pure
p-value ranking or pure fold-change ranking.
It should be noted that many combinations of thorough statistical analyses and fold-change cutoff were conducted in Tan
et al.'s original study [
11]. However, the results that were emphasized and shown in the Venn diagram [
5,
11] (Figure ) are obtained from gene selection solely based on a statistical significance cutoff regardless of fold-change or signal reliability. Furthermore, because of the use of the same statistical significance cutoff, Tan's analysis resulted in an unequal number of selected genes from the three platforms and the two regulation directions. Therefore, the calculation of concordance becomes ambiguous and can underestimate cross-platform concordance.
Results with different numbers of genes selected as significant
In addition to selecting 100 genes (50 up and 50 down) from each platform (Table ), different numbers of genes were selected by applying the three gene selection methods to both the 2009-gene and 537-gene datasets. The results are shown in Figure and agree with the general conclusions discussed above when 100 genes were selected, i.e., data filtering increased cross-platform concordance and p-value ranking resulted in the lowest cross-platform concordance. Within the same dataset, the difference in POG by different gene selection methods diminishes as the percentage of selected genes increases. However, the POG difference due to gene selection methods is much more significant when the percentage of selected genes is small. The POG by p-value ranking is consistently lower than that by fold-change ranking or SAM. The extremely low POG when only a small percentage of genes are selected as significant indicates the danger of using p-value alone as the gene selection method.
Considering the large technical and biological variations identified in Tan's study, we conclude that the level of cross-platform concordance with the subset of 537 genes and by fold-change ranking or SAM is reasonable. Importantly, we observed no statistical difference between cross-platform LRr2 and intra-platform biological LRr2 after data filtering when all three platforms were considered (Table ). However, it should be pointed out that the cross-platform LRr2 was based on the correlation of the averaged log ratios over the three pairs of biological replicates from each platform as represented as Aff (Affymetrix), Ame (Amersham), and Agi (Agilent) in the right-bottom of Table .
Relationship between LRr2 and POG
From hundreds of pair-wise LRr2 versus POG comparisons made on Tan's dataset (Tables and ), a strong positive correlation (r2 = 0.963) between LRr2 and POG (Figure ) was observed. Therefore, it is essential to reach high log ratio correlation in order to achieve high concordance in cross-platform or intra-platform replicates comparisons.
POG by chance
It should be noted that, in addition to cross-platform LRr2, POG also depends on the percentage P (between 0 and 1) of the total number of candidate genes selected as "significant". As an illustration, Figure shows simulated POG results from random data of normal distribution of N(0,1), where there is no correlation between replicates or platforms (i.e., LRr2 = 0). For the comparison of two replicates or platforms, a POG of 100*(P/2) is expected by chance and the other 100*(P/2) is expected to be dis-concordant in the directionality of regulation. For example, if all genes (P = 100%) are "selected" as significant (50% up and the other 50% down) for both replicates or platforms, by chance one would expect 50% of the total number of selected genes to be concordant in regulation direction (the other 50% of selected genes will be in opposite directions). For the comparison of three replicates or platforms, the percentage of genes expected to be concordant by chance is 100*(P/2)2; therefore, 25% of genes are expected to be concordant if all genes are "selected". For the comparison of k platforms (or replicates), the POG expected by chance would be 100*(P/2)k-1. The POG by chance is independent on the choice of gene selection methods.