The post-genomic technologies, such as transcriptomics and proteomics, are continuing to produce new challenges for the biological community. All new technologies require new skills for interpretation, but in addition the 'omic technologies require expertise in data-management and exploration of multivariate data. All too often researchers are unable to extract the full benefit from their investment in research due to difficulties in applying best bioinformatics practice to their experiments. This problem is set to increase as more researchers adopt these high throughput methodologies, particularly with the move towards integrative and systems biology, where datasets are becoming larger, requiring extremely careful quality control at all steps, before cross-experiment data mining is meaningful or indeed possible.
Additional issues hampering the bench scientist are those of software usability and implementation in a fast-developing research area. Array technologies are under continual development e.g. exon arrays, ChIP-on-chip, SNP chip as well as data now coming online from the non-array high through-put technologies such as ChIP-Seq and more recently high throughput transcriptomic sequencing (RNA-Seq).
This is besides the wealth of statistical methods development leading to a multiplicity of new algorithms and tools, often fast-tracked to the public domain with rudimentary user interfaces or available only as scripts for command-line statistical packages such as R [1
]. This leads to a time-lag in adoption for many scientists who may not have the necessary expertise or time to learn how to install or maintain multiple programs, often requiring multiple additional libraries, or to run a number of individual tools from the command-line or via custom scripts. The problem becomes more acute when we consider that the compute requirements for some analyses are becoming too large for comfortable analysis on individual desktop machines, due to a combination of dataset size and algorithm choice. For instance, the requirement to support normalisation across a hundred Affymetrix exon arrays, each with up to 5 million data points using for example the RMA pre-processing algorithm, is by no means unusual in a microarray experiment, but this could be too demanding for the hardware resources available on a desktop computer. This situation can be addressed by harnessing distributed computing resources, but again the start-up time investment is beyond the scope of most individual end-users.
To facilitate microarray data analysis and management we have developed EMAAS (Extensible MicroArray Analysis System), a multi-user Rich Internet Application (RIA), utilising a distributed computing back-end. The EMAAS infrastructure supports:
Data transfer between specialised sites and data repositories
Single point access using a bespoke portal to a range of tools and packages for microarray analysis with seamless data flow between the various tools
Fast-track easy access for biological researchers via the portal to new models and algorithms developed in-house and externally e.g. by statisticians, computer scientists via modular wrapper implementations
Automated detailed tracking of all the analysis steps performed
Storage of raw data, analysis steps and analysed data in an underlying relational database
Ability to access online live expertise in data analysis from local support services for researchers, including remote audio-visual interaction between researchers and staff on different sites (e.g. shared live screen views, audio-video). This can be extended to include access for collaborating experts in other specialities e.g. statisticians.
The system builds upon several open-source technologies already available for microarray data analysis, combining them to form a fully integrated user-friendly system. This allows the user to perform data management and analysis tasks through a single web interface.
Numerous microarray tools are already available for various stages of the microarray data analysis workflow, including several client-server based tools such as the commercial packages GeneSpring GX Workgroup[2
](licensed by Agilent) and Resolver[3
] (licensed by Rosetta) and freely available packages such as ExpressionProfiler[4
] and Gecko[6
]. Each has its own advantages and disadvantages with respect to parameters such as cost, ability to handle concurrent users in a multi-user environment, scalability, and ease of use of the user interface.
The aim of this project was not to re-write these tools in a static closed system, but to build a modular flexible framework that allows single-point access to existing tools and specialist websites running both on the local server and remotely, and to enable new algorithms, methods and web services to be added as and when they are developed. This enables a user to perform their analysis from start to finish through a single user interface, using the most appropriate data handling and analysis tools, without the requirement to continually update and install multiple programs on their own desktop machine. The system was designed specifically to support microarray analyses and was optimised with this in mind.