With GeneTrail we present a web-based application facilitating the statistical evaluation of high-throughput genomic or proteomic data sets. The statistical analysis takes into account the identification of a variety of functional categories that are ‘enriched’. The analysis is based on two different statistical approaches, namely ORA and GSEA.
The selection of the analysis method depends on the performed experiment. GSEA is only applicable for sorted gene sets, whereas ORA can be applied to the detection of over- or under-representations of categories in any data set compared to a reference set. In contrast to comparable methods, GSEA represents a threshold free approach. Frequently in ORA of microrarray data, the user determines which genes are considered as upregulated by choosing an expression threshold X. In contrast, GSEA does not rely on a chosen parameter but considers the entire sorted list of transcripts.
The relation of the data and the reference set is crucial. If the data set is a subset of the reference set, the hypergeometric distribution is used to compute P-values. For disjoint data and reference sets, Fisher's exact test is applied instead. In all other cases, GeneTrail offers the possibility to download an appropriate reference set disjoint to the data set.
One important feature of GeneTrail is a novel algorithm for computing the exact P
-value in GSEA. So far, the statistical significance of GSEA is approximated by so-called permutation tests that usually consider only a small amount of all possible permutations. For example, regarding a microarray containing 20
000 genes and a category containing about 2000 genes, the number of possible arrangements is given as
, approximately 4·102821
. Even if thousands of permutations are computed, this will probably not yield a good estimation of the P
-value. However, with our approach, we are able to compute the correct P
The offered statistical approaches treat the genes to work individually. This of course does not reflect the reality, where genes act together. The significance of findings, especially their sensitivity, may be improved by integrating additional information concerning the genes and their interactions, especially the topology of biological pathways. However, we have to carefully select the a priori biological information to be included. Since this information could be related to biological coherences we want to detect, we would introduce a bias into the data set.
A common problem of biological data management is the usage of appropriate identifier for genes or proteins. External identifiers need to be mapped to the identifiers used internally. In our case, the NCBI Gene IDs are the internal identifiers. If a provided data set does not contain NCBI Gene IDs, GeneTrail needs to convert these IDs. Therefore, we recommend the usage of NCBI Gene IDs to avoid possible mismatches.
The interactive graph visualization offers the possibility to grasp the interactions between genes of a data set and computed significant categories. However, large gene sets lead to complex graphs that are hard to visualize and to interpret. Currently, we are developing improved methods for online visualization of large graphs that we plan to integrate into future versions of GeneTrail.
GeneTrail offers the possibility to extract information from complex proteome data, microarray data or data generated by other high-throughput methods with minimal effort. Two examples of GeneTrail analyses can be found in the Supplementary Data. In conclusion, GeneTrail complements the conventional evaluation of experimental data and offers new starting points for further experimental investigation.