GOAL enables a statistical approach to GO analysis. GOAL allows evaluation of P
-values and FDR, both important parameters for establishing the practical usefulness of selected genes and function. There are two alternative data-entry procedures, one which computes statistics on an expression table and the other which uses the output from an external statistical package, such as Bioconductor (11
), SAM (12
) or GeneSpring (http://www.genespring.com
In the first instance, for example, a two-class comparison can be performed by using a t-test approach coupled to permutation analysis on an expression table. As a special case, a one-class analysis can be performed when the common reference used in a two-dyes experiment is itself one of the two classes, i.e. disease sample versus control. In order to produce efficient experiment-wise estimates of significance levels in the subsequent GO terms procedure, a set of 10 score columns is generated by balanced permutations and saved for further analysis.
In the alternative procedure, any external statistical application can be used; the GOAL input is thus a table containing the score associated with each gene on the chip, e.g. a Fourier transform. The submitted table contains at least two columns, i.e. the gene ID and the statistical test score columns. Extra columns with the results from balanced permutations, in order to allow GOAL efficient FDR evaluation, can be added into the table.
After data upload, a web page is generated with a list of all GO terms differentially regulated in the experiment, according to the scoring procedure applied. To provide a P
-value for each GO term and FDR, a permutation analysis (13
) is performed on the submitted dataset by permuting each row within the score column, or, if present, within each extra score column calculated from balanced permutations. In order to assess the number of permutations needed for robust generation of P
-values and determination of FDR, a number of datasets were used and a range of permutations were performed within each dataset. Varying the number of cycles, and depending on the dimension of the expression tables, we generated permutations with up to 2 × 106
total data. It appears that ~5 × 105
total scores are sufficient for optimal FDR evaluation, since by increasing the number of cycles and subsequently the amount of data generated, the shape of the test distribution does not seem to be affected (Figure ).
Figure 3 Evaluation of Gene Ontology terms significance. (A) t-score distribution and the relative test distribution generated by permutation analysis for K = 4. In order to compute the P-values, 100 permutations were performed by using each of 10 t-scores (more ...)
An important side-effect of using GOAL is the automated conversion of ESTs/oligonucleotides to Unigene clusters. The vast majority of packages for expression profile analysis in fact use a single probe/target approach, i.e. selects the differentially expressed cDNA clones or oligonucleotide. GOAL, however, uses the latest Unigene build in order to compute the mean score for all the ESTs/oligonucleotides related to that Unigene cluster. This procedure, necessary to associate GO terms with genes, leads to the reduction of the complexity of the dataset.
Examples of GOAL application to selected published datasets—namely, healthy blood variation (14
), diffuse large B-cell lymphoma transformation (15
), renal cancer (16
), soft tumors (17
), lung adenocarcinoma (18
) and breast cancer (19
)—are outlined in the Web supplement (http://microarrays.unife.it/GOAL/
). SAGE (20
) datasets were used by entering into the expression table the TPM values as retrieved from GEO.
Transcriptome wide and restricted Gene Ontology analysis
Besides the two data-entry procedures, alternative routes to GO analysis can be followed by the user. In one instance, only those genes which are differentially expressed within the experiment can be used to infer GO results. This method, which we call ‘restricted’ because it takes into consideration only the subset of genes which are regulated, allows faster analysis and the evaluation of those functions solely related to the pool of regulated genes. This algorithm is similar to that used by most GO applications but, by being restricted to the subset of differentially expressed genes, might miss a fraction of the cell-wide regulated functions and processes. For example, an upregulated process might result from the coordinated upregulation of a number of genes, even though all of them have scores slightly below the significance threshold.
A second path that can be followed by the user is when all the genes in an expression profile are considered for GO analysis. In this way, information is gathered from all the mRNAs measured in the experiment, not only from differentially regulated ones. This method, which we call ‘transcriptome-wide’, is slower and might yield somewhat different results when compared with the ‘restricted’ approach. For example, invariant genes might affect the differential regulation of a cellular process, even when other genes related to the same process are differentially transcribed. Or, as explained above, all the genes representing a molecular function might be upregulated, but just below the significance threshold used for gene selection; nevertheless, when a transcriptome-wide analysis is performed the corresponding GO term will be selected as significantly upregulated.