SiPaGene provides rapid GeneChip analysis based on MAS5.0/GCOS statistics with a standardized workflow to generate various statistical parameters for optimized but also flexible selection conditions. An important feature is the management of access with administrative tools to define for each comparison absolute privacy, different levels of user specific sharing or full public access. Thus, the SiPaGene database combines the functions of a repository for gene expression data with tools for flexible and optimized primary as well as meta-analysis with single and multiple group comparisons. Increasing numbers of public and private data sets along with the sharing options enable validation, improve interpretation and encourage controlled exchange of array data.
Other algorithms like RMA have replaced MAS5.0/GCOS pairwise comparison statistics. Previous reports demonstrated that signal normalization by these newer algorithms improve results of comparative analysis [
29,
32]. However, these tools have been compared with MAS5.0/GCOS on the basis of probe set signal calculation. Here we could show exemplarily with the data of the Latin Square spiking experiment that a consequent application of the MAS5.0/GCOS pairwise comparison statistics provides more robust results than analyses with dChip or with RMA and SAM.
Many GeneChip array experiments in GEO or ArrayExpress were performed with a limited number of hybridizations (usually less than 5 arrays per group). Two principle factors influence statistical power in these experiments: biological and technical variability. While testing of biological variability cannot be improved except by increasing the number of array hybridizations, technical variability can be assessed using the GeneChip information from individual oligonucleotides of each probe set not only for signal calculation but also for statistical testing of differential expression between two arrays. This extended information is provided by the change call statistics in MAS5.0/GCOS for pairwise comparisons and is summarized in the derivative parameters calculated in SiPaGene for each group comparison. The exponentially increasing number of pairwise comparisons with increasing numbers of arrays per group is certainly a disadvantage and a limit of this approach. For example, comparing two groups of 10 arrays each will require calculation of 190 pairwise comparisons, two groups of 50 arrays each already 4950 pairwise comparisons. However, this is a rare problem and calculating and importing up to 5000 pairwise comparisons for such a particular experiment is not out of reach. Furthermore, these larger groups are often clustered in subgroups, which can be conveniently further investigated by calculating new group comparisons of these subgroups because all relevant pairwise comparisons are already imported in SiPaGene for the full group comparison.
Concerning performance of queries, the first selection step from the table storing all probe set information from a group comparison is based on the index pointing to the name of the group comparison. All other conditions are without index and subsequently retrieved out of the number of all probe sets per comparison. Thus, performance of the database will be challenged when the number of group comparisons is substantially higher than the number of probe sets per array.
The main object of SiPaGene is immediate access to the results of MAS5.0 and GCOS statistics. Raw data and experimental metadata are maintained in other excellent platforms and are therefore linked for all public data mainly from GEO. This concept was favored to harmonize information and to avoid work for already existing and constantly curated information.
Considering that for the majority of experiments only a small subset of transcripts is changing, the global normalization method implemented in the MAS5.0/GCOS software was applied to scale all arrays to a constant overall intensity. This enables to constantly expand the number of arrays without renormalization. Currently favored algorithms like RMA require renormalization of the whole set of arrays with each additional array to allow comparison between all arrays. Therefore, it has been suggested to normalize each array to a set of reference arrays [
33]. This seems to overcome the initial limitation when we were starting to setup SiPaGene and thus may offer to integrate other algorithms like RMA in the future.
SiPaGene was set up as a database that combines both, high quality of retrieval options not only for specialists in bioinformatics and storage of the growing number of microarray experiments for meta-analyses. It was developed for the Affymetrix GeneChip platform technology and allows rapid and automated calculation for experiments with many different group or subgroup comparisons. The quality of optimized queries was tested using the Latin Square experiment provided by Affymetrix. This data set has frequently been used to optimize bioinformatic tools for microarray data analysis [
32]. We could demonstrate that the MAS5.0/GCOS primary signal and pairwise comparison analysis provide a solid basis to identify the relevant candidate genes. Sensitivity was only decreasing when spike concentrations were very low (cf. charts 2 – 6 in figure ) but this was also observed with dChip or with RMA and SAM. Especially, all comparisons of the experiments two to six, which affected the highest number of interfering probe sets (n = 15 out of 23, spike group 14) with the lowest spike concentrations for group 14 (0 – 1.0 pmol), revealed the lowest recall rates for true positives. Nevertheless, the optimized filter strategy of SiPaGene could outperform standard tools like SAM [
6] and dChip [
34]. It revealed an excellent recall rate for true positives and the lowest rate for false positives even for the small replicate number of three arrays per group. This demonstrates that MAS5.0/GCOS algorithms with normalization of each array separately and statistics based on many different oligonucleotide probes per probe set are highly effective to identify the relevant differences.
Another important option in the SiPaGene database is the possibility to restrict access and to enable user- and comparison-specific sharing of data. Many expression studies have been published without submission of the related raw data to any of the public repositories. This indicates that scientists are very conservative in terms of sharing their data freely. One important reason for this seems to be the limitation to interpret the biological processes despite holding a genome-wide transcription profile in hands. There is hope that appropriate tools to elucidate the function behind these data will improve constantly and give much better insight within the next few years. Based on own experience, functional interpretation improves with the number of comparisons performed with different if possible defined reference signatures and therefore is a cornerstone for future array analysis [
35]. Such signatures depend on high quality experiments and will be the least ones to be shared publicly. Therefore, tools are needed that encourage collaborative exchange and thereby enable the development of new and better tools for interpretation of expression data.
Based on the tools for detailed group analysis of individual GeneChip experiments, options to analyze multiple group comparisons were integrated. These are indispensable to perform meta-analyses. With a growing set of reference signatures, it will be possible to define the degree of specificity of individual genes for a defined biological function and to develop signature based functional annotation tools. These are important and complementary to existing annotation and interpretation software based on literature information about individual genes, gene interactions and biological functions [
12,
36,
37]. Generating annotations based on such meta analyses, this information can be immediately sourced to experimental data while literature based annotations depend on the quality of assignment and are often longsome and difficult in tracing back.
Next improvements, which are currently in preparation, will include an expanded functionality, such as tools for visualization (clustering), upload and administration of gene lists for comparative retrieval with predefined candidate genes, selection and storage of marker genes for quantification of cell-type and stimulus-specific signatures and to enable users to define expression-based annotations.