The oral cancer gene expression data set was obtained from OSCC tissue samples as reported by Mendez et al [6
]. The experimental design included tumor samples from patients diagnosed with squamous cell carcinoma and normal tissue samples obtained from healthy patients who were scheduled for an oral surgical procedure not related to cancer. The collected samples were prepared for RNA isolation, linearly amplified, labeled, and hybridized to Affymetrix HuGeneFL microarrays, which contained 7,070 genes.
For the analysis, we had expression profiles from a total of 36 patients (28 cancer and 8 normal samples). The data were pre-processed through multiple filtering steps:
- The expression values less than 50 were set to “50”; below this value, expression values can be considered noise and unreliable. Since this was an older generation Affymetrix array, many negative values were present, and these were replaced in the same way.
- About a thousand genes that had uniformly negative or very low values for most samples in both conditions were removed.
- Those genes with low overall variance across all the samples were eliminated since they are of limited interest.
Without filtering, many top scoring pairs had high correlations due to the negative outlier values. Eliminating these cases using proper filtering reduced the amount of noise in interpreting the data.
For each group of samples, the pairwise Pearson correlations were calculated and represented in a correlation matrix (Figure 1). For each correlation coefficient for a pair of genes , a Fisher’s Z-transformation was applied. The transformation is given by the equation,
This transformation results in the change of the range of the variable, and makes it possible to derive (or apply) a statistical test. For any given set of observations, the range of the correlation is −1 ≤ p ≤ 1, but after transformation the range of the new variable is − ∞ ≤ z ≤ ∞. M ore importantly, the Fisher’s transformation improves the distributional property and allows a statistical test to be devised under normality assumptions.
Using transformed values, a statistical test was then applied to see whether the change in the correlation for that gene is statistically significant. The larger Z-scores indicate that the gene pair relationships are statistically significant. The Z-score is given by the equation,
where N1 and N2 are the sample sizes for the two groups. If the Pearson correlation coefficient is used, the assumption is that the underlying relationship is linear between two variables. A problem using this measure is that if the relationship were not linear, it would not provide a valid measure of their association. More importantly, it can be severely distorted by undue influence from outliers, and thus may not provide an accurate description of the underlying behavior of the genes. Outliers can significantly change the results, especially for Affymetrix data, where extremely high values may be present. To reduce the effects of outliers, one could remove the outliers and recalculate the correlations; alternatively, one could use a non-parametric test. For the dataset we studied, we applied both parametric and non-parametric correlations, and we observed that the results improved in general when the non-parametric Spearman rank correlation was applied to the data.
The significant gene-pairs we obtained were entered into a literature cluster analysis program called PubGene [7
]. We used the PubGene MeSH and literature network tools. All other analyses were performed using MATLAB (Math Works, Natick, MA).