|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Though microarray experiments are very popular in life science research, managing and analyzing microarray data are still challenging tasks for many biologists. Most microarray programs require users to have sophisticated knowledge of mathematics, statistics and computer skills for usage. With accumulating microarray data deposited in public databases, easy-to-use programs to re-analyze previously published microarray data are in high demand.
EzArray is a web-based Affymetrix expression array data management and analysis system for researchers who need to organize microarray data efficiently and get data analyzed instantly. EzArray organizes microarray data into projects that can be analyzed online with predefined or custom procedures. EzArray performs data preprocessing and detection of differentially expressed genes with statistical methods. All analysis procedures are optimized and highly automated so that even novice users with limited pre-knowledge of microarray data analysis can complete initial analysis quickly. Since all input files, analysis parameters, and executed scripts can be downloaded, EzArray provides maximum reproducibility for each analysis. In addition, EzArray integrates with Gene Expression Omnibus (GEO) and allows instantaneous re-analysis of published array data.
EzArray is a novel Affymetrix expression array data analysis and sharing system. EzArray provides easy-to-use tools for re-analyzing published microarray data and will help both novice and experienced users perform initial analysis of their microarray data from the location of data storage. We believe EzArray will be a useful system for facilities with microarray services and laboratories with multiple members involved in microarray data analysis. EzArray is freely available from http://www.ezarray.com/.
One of the major problems that life science researchers have to cope with is the management of huge amounts of data, which is ever increasing with advances in robotics and microarray technologies. More and more laboratories have begun adopting Structured Query Language (SQL)-based relational databases, such as Oracle and MySQL, to solve life sciences data management problems. Another major problem in life sciences is secure and efficient data sharing, especially when the data is in large scale. A common temporary solution is using shared folders on the internet or intranet; however, this option provides minimal data security and is accompanied by difficulties in associating information with files. Therefore, novel, easy-to-use, and powerful data management and sharing systems are needed in the life sciences.
Though microarray-based experiments are becoming popular in life science research, microarray data management and analysis are still challenging tasks for many biologists. Researchers often use manual or custom developed systems for microarray data management and analysis, which are limited by developers' knowledge of computer sciences, biostatistics, and life sciences. Commercial products for microarray data analysis are being released, such as GeneSpring GX from Agilent Technlogies, Santa Clara, CA 95051 and GeneSifter from VizX Labs, Seattle, WA 98119. However, these products are often complicated, expensive, and/or lack data sharing capabilities. For biostatisticians and experienced analysts, R language  and Bioconductor  are the main microarray data analysis tools. R is a widely used open source language for statistical computing and graphics. Bioconductor, which is primarily based on the R programming language, is an open development software project for the analysis and comprehension of genomic data. Currently, hundreds of Bioconductor packages have been developed, providing comprehensive functionalities for all aspects of microarray data analysis. For example, affy is often used for low-level analysis of Affymetrix GeneChip data, while multtest is useful in detecting differentially expressed genes. Currently, web-based systems based on R and Bioconductor packages are being developed, such as CARMAweb , MAGMA , GEPAS , Asterias , ArrayPipe , MIDAW , RACE , WebArray , and Expression Profiler . While these systems have made microarray data analysis much easier for experienced users, much improvement is needed to further automate the common data analysis processes so that they can be readily accessible to novice users.
In order to provide an easy-to-use microarray system for researchers with little pre-knowledge of microarray data analysis as well as experienced analysts, we have designed a web-based system, EzArray, based on the most recent web technologies, R, and Bioconductor. EzArray is intended to provide: 1) a centralized location to store original microarray data with security; 2) an easy and secure way to share raw data and analyze results among team members; 3) a highly automated data analysis system for instant on-line data analysis; 4) an expandable system to integrate new data management and analysis tools.
On EzArray server, PHP scripts deal with communication between users and the server, dynamically generate R scripts based on user input, execute R scripts in the background, and parse R output and present results to end users as HTML webpages. User information, data files, project information and analysis results are stored in database and server file system. EzArray comes with a web-based file management tool (My Files) and a request job management tool (Job List). On the client end, users logically follow these steps: register, logon, create or join a user group, create projects, import sample information and upload microarray data, submit analysis requests and browse results. The analysis tools (PreQ, ProS, and RepA) can be used in orders as shown in Figure Figure1.1. Users can perform each type of analysis multiple times with modified parameters.
Figure Figure2A2A shows a screenshot of the EzArray (version 2) homepage, Figure Figure2B2B shows a screenshot of the integrated file management tool, and Figure Figure2C2C shows a screenshot of the project browsing and searching tools.
We propose that an ideal microarray system should be easy-to-use for all levels of users, have minimal software and hardware requirements for installation and usage, have data privacy for each user but also allow data sharing with others, be flexible to integrate custom tools, and provide maximum data analysis reproducibility. Based on these ideas, we developed EzArray (Figure (Figure11 and Figure Figure2)2) in the open source application development environment LAMP that consists of Linux operating system, Apache web server, MySQL database, and PHP programming language. The combination of these technologies has become popular because of its low acquisition cost and because of the ubiquity of its components. We release EzArray under the GNU General Public License, providing end users maximal freedom in taking advantage of the EzArray source codes.
EzArray is a multi-user system with web interfaces. All users must first register, and their accounts will become active upon approval by system administrators. User login information is stored in MySQL database with encryption, making EzArray a highly secure system.
In EzArray, data are either stored in MySQL database or on the server as files. Microarray data are organized by projects, and a user can create unlimited projects. Currently, only minimal project information is required, including Affymetrix array chip type, a brief project description, and optional project details. Collecting minimal project information allows users to get started quickly. In each project, only one array chip type is allowed. Besides Affymetrix chip type names such as hug133plus2, users can also enter Gene Expression Omnibus (GEO)  platform names such as GPL570. The array sample information and the CEL files generated from Affymetrix GeneChip Operating System can be added one-by-one or imported into a project in batch. In EzArray data analysis can be performed with all samples in a project or just a few selected samples.
EzArray projects, samples, and analysis results entered by a user may be shared with group members in read-only mode (Figure (Figure2C2C for project management). However, EzArray administrators can adjust data sharing methods by changing the settings in the configuration file.
Microarray data analysis in EzArray is highly automated such that even novice users can perform initial analysis and get results instantly. Experienced users can also use the system with full control of the analysis procedures. In general, microarray data analysis is performed step-by-step with these programs: PreQ – preprocessing, normalization, and quality control plots; ProS – statistical procedures for detecting differentially expressed genes; and RepA – gene annotation and linking to public databases (Figure (Figure11).
PreQ reads the raw Affymetrix expression array data files (CEL files) and completes all necessary data preprocessing, including data background correction, data normalization, correction for non-specific binding, and summarization where the measured probe intensities are averaged to one expression value per probe set (Table (Table1).1). There are four pre-defined data preprocessing methods (RMA, MAS5, dChip, and GCRMA) and one custom method. RMA method uses robust multichip average (rma) algorithm [13,14] for background correction. MAS5 method adopts the Affymetrix MAS5 algorithm [15-18]. dChip uses a special Li-Wong summarization algorithm [16,19] that is a model-based approach, allowing pooling of information across multiple arrays and automatic probe selection to handle cross-hybridization and image contamination. GCRMA uses the background correction method gcrma  that takes probe sequence information into calculation. The custom method in PreQ allows users to select specific preprocessing algorithms for each type of processes. Table Table11 shows a summary of algorithms used in each PreQ method. Detailed comparison of the different Affymetrix preprocessing algorithms can be found in reference . The affy package from Bioconductor is used in PreQ for most data preprocessing tasks. In addition to data preprocessing, PreQ is a convenient tool to assess data quality since it generates many quality assessment plots. The current EzArray version includes histoplots of the intensity data, affyPLM plots that fit probe level models to array data, RNA digestion plots in which ordered probes are used to detect possible RNA degradation, simpleaffy QC plots that provide access to many of the standard QC functions recommended for Affymetrix arrays, MA plots that are widely used to compare two intensity measurements, scatter plots that show the correlation of cell intensity across arrays, and boxplots that help users evaluate the differences in the distributions of intensities across arrays. Reference  studied microarray data quality assessments and provides a good summary of quality assessment plots.
The ProS program in EzArray processes array data and detects differentially expressed genes with different methods based on the number of sample groups and replicates in each sample group (Table (Table2).2). When there are only two sample groups, e.g. experimental condition verse control condition, and a small number of biological replicates (less than 3) in each sample group, ProS simply uses fold changes to detect differentially expressed genes. When a sufficient number of arrays are used in microarray experiments, e.g. three or more replicates in each sample group, various statistical tests can be used to detect differentially expressed genes. Available statistical tests in the current EzArray version include two-sample Welch t-test, two-sample t-test, standardized rank sum Wilcoxon test, paired t-test, F-test, Block F-test, and more. EzArray allows the users to select a number of multiple hypothesis testing methods to control error rate. References [22,23] provide summaries of multiple hypothesis testing methods and their applications in microarray experiments. The main Bioconductor package used in ProS program is multtest.
The RepA program is a web interface of the Bioconductor package annaffy that produces compact HTML and text reports including experimental data and hyperlinks to many online databases. RepA makes the results more understandable for biologists. While the RepA program is tightly linked to the ProS program, EzArray also provides two annaffy-based standalone tools that allow users to annotate genes from a list of probe names or search for probes based on gene annotation information.
Though these three analysis programs are tightly connected and are normally used in sequential order, experienced users can use them individually or combine them with their own analysis scripts. For example, users can first use the highly automated PreQ program to complete data preprocessing and generate necessary quality assessment plots. Then, based on the initial results, users can download and revise the analysis scripts for further analysis.
In order to make our microarray data analysis system practically useful for users with less knowledge of biostatistics and also make it a convenient system for experienced users, we further optimized and integrated our three-step analysis programs into a one-step data analysis program called Express Analysis. With Express Analysis, if users select the optimized settings (Table (Table2),2), they can obtain a list of differentially expressed genes with annotation in just a single click. Figure Figure33 shows the screenshots of EzArray Express Analysis from searching GEO databases (Figure (Figure3A),3A), selecting samples (Figure (Figure3B),3B), and finally, obtaining analysis results (Figure (Figure3C).3C). Again, experienced users can select "custom" methods to tune data analysis parameters and select desired analysis algorithms. Express Analysis in EzArray represents the most automated microarray data analysis program currently released.
EzArray can be used to re-analyze previously published Affymetrix expression array data that were deposited in GEO, the main gene expression/molecular abundance repository. Due to rapid advances of microarray data analysis technologies and specialized foci of previous microarray researchers, it will be of significance to re-analyze some published microarray data. In addition, with increasingly accumulated microarray data in GEO, it becomes possible to study some new research projects using deposited microarray data from different laboratories. For these purposes, we have added a very convenient tool in EzArray allowing users to search GEO microarray data, automatically download array data, and create new projects with automatically populated sample information (Figure (Figure33).
EzArray contains tools to retrieve the descriptive information of all GEO records, including Platforms, Series, Samples, and DataSets. EzArray also contains a simple search form allowing users to search for DataSets (Figure (Figure3A).3A). The search results include a list of DataSets with links to GEO website. Users can select one or more DataSets from the list and create projects for data re-analysis. EzArray downloads the corresponding DataSet files automatically from the GEO website.
Analysis of GEO microarray data may be started from raw microarray data (e.g. .CEL files) or pre-processed expression value data originally submitted by previous authors. In the first case, if authors have submitted .CEL files to GEO, EzArray will automatically download these supplementary files and link them to the newly created projects. Then users can perform data analysis with EzArray programs as previously stated. In the second case, regardless of whether authors submitted raw microarray data or not, EzArray retrieves curated datasets from GEO, extracts the dataset information, and stores them in the project, which can be readily used to perform data re-analysis with program ProS directly without performing data pre-processing with PreQ.
EzArray is a web-based Affymetrix expression array data management and analysis system implemented in an open source environment. Since the same technologies are often used to build database-powered websites, EzArray can be easily integrated with users' existing websites. In summary, EzArray takes advantages of modern web technologies, provides multiple user support, has group-based data sharing capabilities, contains tools for highly automated data analysis, and has user-friendly interfaces. These features distinguish EzArray from most other standalone and web-based microarray programs.
Most microarray data analysis tools have been implemented as Bioconductor R packages that run from the command line or have simple point-and-click graphic interfaces. Both R packages limma  and affy offer R users a command-line interface to state-of-the-art microarray data analysis techniques. The R packages affylmGUI  and webbioc offer simple point-and-click interfaces to many of the limma and affy functions. It seems these programs simply analyze data instead of providing comprehensive data management capabilities.
Recently, more and more web-based microarray systems have been developed. MAGMA  is a Java-based web application that provides a simple and intuitive interface to identify differentially expressed genes from two-channel microarray data. MAGMA does not support databases, and the results are file-based. Though MAGMA provides for each user a separate workspace for storing and analyzing microarray data, MAGMA lacks tools for data sharing among users. Similar to EzArray, MAGMA automatically generates R-scripts that document the entire data processing steps. However, EzArray takes it further by allowing the users to download all input and output files together with R-scripts. This guarantees the user to regenerate all results in his local R installation. In terms of data analysis, MAGMA does not contain the gene annotation step and the results are tab-delimited text files and graphic plot files. The RepA program in EzArray generates HTML webpages with hyperlinks to public life science databases. In addition, compared to EzArray, MAGMA does not include algorithms to automatically select data analysis methods and parameters, and therefore, the analysis process is less automated. GEPAS  has been designed to provide an intuitive web-based interface that offers diverse analysis options from data preprocessing to gene selection, gene clustering, gene annotation, and more. Instead of taking advantages of existing R and Bioconductor packages, GEPAS has incorporated many newly developed programs written in 'C' languages. The web interfaces of GEPAS are Perl CGIs. The most recent version of GEPAS (v4.0) has included very simple tools for user registration as well as data file browsing. In addition, due to the abundance of novel programs and low level of automation in data analysis, using GEPAS requires in-depth knowledge of the system and many microarray data analysis algorithms. Asterias  is an open source and web-based suite for the analysis of gene expression and aCGH data. Asterias is the only web-based application that uses parallel computing. Asterias also takes advantages of many R and Bioconductor packages including limma. The web interfaces of Asterias are mostly written in Python. Though a few applications in Asterias support MySQL database, Asterias does not contain any tools for user or data management. The input data to all applications are plain text files that are uploaded "on the fly" during analysis. The web application CARMAweb  was implemented in Java based on J2EE (Java 2 Enterprise Edition) software technology. It supports Affymetrix GeneChips, spotted two-color microarrays and Applied Biosystems (ABI) microarrays. CARMAweb has a simple user management tool that guarantees password protected access to the user's data and analysis results. All user data are stored as files in the user data directory. Currently, CARMAweb does not support databases and group-based data sharing. WebArray  is another microarray system implemented with technologies similar to those used in EzArray (WebArray used Python instead of PHP programming language). WebArray provides a user-friendly interface for accessing a wide range of key functions of limma and other Bioconductor packages. WebArray is an excellent free open source software system for microarray analysis that can be used by an average biologist after moderate training. Nevertheless, WebArray has limited capabilities in data management and data sharing. WebArray is not project-oriented and all data are stored as files in one user data directory. Though WebArray allows users to download output files (tab-delimited text files and graphic plots), it does not allow downloading of executed R scripts. When compared to these web microarray systems, EzArray features much more intuitive user interfaces, more powerful data management capabilities, and significantly higher levels of automation in the analysis processes.
EzArray was designed to be operating system-independent due to the cross-platform features of Apache and PHP. EzArray is also expected to be database platform-independent due to the adoption of a database abstraction library ADOdb  that supports most SQL-based databases. This provides the flexibility for end users to select convenient operating systems and database servers. So far, we have fully tested EzArray on the Linux operating system (Fedora 7) with MySQL database, and we are planning to test EzArray on other operating systems with various databases.
The current version of EzArray stores only minimal experimental information. We are planning to develop new database tables and corresponding web interfaces for storing MIAME -compliant microarray data.
Due to the modular structures and open source features of EzArray, extensions or new functionalities can be rapidly implemented on top of EzArray. We have already started designing web-based tools for analyzing Agilent and Nimblegen microarray data. Even with Affymetrix expression data, our analysis procedures can be further improved. For example, for data with two sample groups and just a few replicates per group, the current version of EzArray simply uses Fold Changes to select differentially expression genes. In next EzArray version, we plan to enhance the data analysis procedures with more established algorithms and programs, such as limma, SAM [28-30], and EBArrays [31,32].
EzArray is an Affymetrix expression array data management, analysis, and sharing system. Besides tools for users to organize their own microarray data online and perform instant data analysis, EzArray contains tools for re-analyzing previously published microarray data deposited in GEO. EzArray can not only help novice users perform initial analysis of their microarray data, but also allow experienced users to perform custom analysis from the location of data storage. In summary, EzArray will be a useful system for facilities with microarray services and laboratories with multiple members involved in microarray projects.
EzArray is released under General Public License and can be freely used at website http://www.ezarray.com/. To install EzArray locally, users need to set up a Linux server running Apache and MySQL. Experienced users may be able to install EzArray on Mac or Windows operating systems with different database servers. Recent versions of R and Bioconductor should be pre-installed and properly configured.
YuerongZ developed the main idea, coded the majority of EzArray, and drafted the manuscript. YuelinZ was involved in system and statistics design, coded the majority of R scripts, performed overall program debugging and testing, and was involved in revising the manuscript. WX provided helpful discussion and performed software tests. She was also involved in revising the manuscript. All authors read and approved the final manuscript.
Wei Xu was supported in part by Susan Komen Breast Cancer Foundation grant number BCTR95306. We would like to thank Emily Powell for the critical review the manuscript. Finally, we would like to thank the developers and maintainers of R and Bioconductor packages that are extensively used in EzArray, and anonymous reviewers for helpful comments and suggestions.