|Home | About | Journals | Submit | Contact Us | Français|
Gene Ontology (GO), the de facto standard in gene functionality description, is used widely in functional annotation and enrichment analysis. Here, we introduce agriGO, an integrated web-based GO analysis toolkit for the agricultural community, using the advantages of our previous GO enrichment tool (EasyGO), to meet analysis demands from new technologies and research objectives. EasyGO is valuable for its proficiency, and has proved useful in uncovering biological knowledge in massive data sets from high-throughput experiments. For agriGO, the system architecture and website interface were redesigned to improve performance and accessibility. The supported organisms and gene identifiers were substantially expanded (including 38 agricultural species composed of 274 data types). The requirement on user input is more flexible, in that user-defined reference and annotation are accepted. Moreover, a new analysis approach using Gene Set Enrichment Analysis strategy and customizable features is provided. Four tools, SEA (Singular enrichment analysis), PAGE (Parametric Analysis of Gene set Enrichment), BLAST4ID (Transfer IDs by BLAST) and SEACOMPARE (Cross comparison of SEA), are integrated as a toolkit to meet different demands. We also provide a cross-comparison service so that different data sets can be compared and explored in a visualized way. Lastly, agriGO functions as a GO data repository with search and download functions; agriGO is publicly accessible at http://bioinfo.cau.edu.cn/agriGO/.
The availability of high-throughput techniques allows biologists to monitor changes and regulation at a genome-wide level under certain conditions. Such experiments normally generate huge data sets of genes’ expression values under different treatments. There are challenges in the analysis and interpretation of these data sets with one promising strategy to solve these problems being gene-annotation enrichment analysis. The bioinformatics community has developed multiple enrichment tools which were compared and summarized by Huang et al. (1). The majority of these tools (2–12) employ Gene Ontology (GO) (3) as their annotation resource, since GO is a controlled vocabulary system with rich content for gene function description at a molecular level and is supported by many consortia focusing on different organisms. Unfortunately, most GO enrichment tools have limited support for agricultural species. Recently, four applications enabling analysis of agricultural species data were evaluated by Berg et al. (13). Among four tools, only EasyGO (12) is designed to especially serve the agricultural community. Since its release, this tool has processed >20 000 analysis requests from all around the world and is referenced by 20 publications. After 3 years of continued maintenance, we developed the successor of EasyGO, a web-based toolkit named agriGO with enhanced and novel functionalities.
Retaining the advanced features of EasyGO, agriGO also continues to focus on agricultural species. The enrichment analysis approach used in EasyGO is categorized as SEA (Singular enrichment analysis) in Huang’s survey (1). We kept this method because although SEA is the most traditional strategy, it is still very efficient and such continuity will not reduce its accessibility to past users. However, new features were added to meet current complex demands. First, new tools including PAGE (Parametric Analysis of Gene set Enrichment), BLAST4ID (Transfer IDs by BLAST) and SEACOMPARE (Cross comparison of SEA) were developed. The arrival of these tools provides users with possibilities for data mining and systematic result exploration and will allow better data analysis and interpretation. Second, the exploratory capability and result visualization are enhanced. Results are provided in different formats: HTML tables, tabulated text files, hierarchical tree graphs, and flash bar graphs. Third, in agriGO, PAGE and SEACOMPARE can be used to carry out cross-comparisons of results derived from different data sets, which is very important when studying multiple groups of experiments, such as in time-course research. Furthermore, we integrated comprehensive annotations like gene description and protein domain annotation into agriGO, and the information is searchable and downloadable. Technically, working on a more powerful server, agriGO is completely reengineered providing a faster, more robust and flexible tool. Flash technology (http://teethgrinder.co.uk/open-flash-chart-2) is used to generate the result bar graphs. Lastly, this new toolkit is user-friendly with an interactive help system and flexible input requirements.
Huang et al. (1) classified enrichment tools into three categories: SEA, GSEA (Gene Set Enrichment analysis) and MEA (modular enrichment analysis). EasyGO (12) is classed as SEA. In agriGO, the enrichment analysis strategy in EasyGO is kept and improved, and named as ‘SEA’.
SEA analysis computes GO term enrichment in one set of genes by comparing it to another set, named the target and reference lists, respectively. As for EasyGO, a default reference list with pre-computed GO term mappings is provided for each data type.
For each supported species, we collected currently popular gene nomenclatures and probe (set) names from different microarray platforms, and computed background GO term mappings. With rice for example, available background includes TIGR (14) and Gramene (15) genes, KOME (16) full-length cDNAs and microarray probe (set) IDs from Affymetrix (www.affymetrix.com), Agilent (www.agilent.com), BGI (www.genomics.org.cn) and other platforms. As a new feature, a custom list with user-defined GO annotation can be uploaded as either the target list or the reference list. agriGO allows arbitrary combination of target and reference lists, to address the data deficiency issue for species that do not yet have a sequenced genome (GO backgrounds from most related species can be used when analyzing gene sets from such species). Such cross-data type combination and interpretation should be conducted with care.
For advanced options, three statistical methods can be selected: hypergeometric, Fisher’s exact and χ2 tests. When the target list comprises a subset of the reference list, the hypergeometric test or Fisher’s exact test should be applied. If the target list has few or no intersections with the reference list and its size is large, χ2 is appropriate.
The multi-testing problem seems inevitable when a large number of GO terms are subjected to statistical calculation. Therefore, SEA performs the Benjamini–Yekutieli method (17) to do the multiple comparison correction by default, while others, such as Benjamini–Hochberg (18), Storey q-value (19) and Holm (20) methods, are also available. The same choices for adjustment methods are provided for the PAGE analysis, as described below.
GSEA is a popular way to do enrichment analysis, since it reduces the arbitrary factors in the gene selection step of SEA and can utilize more information such as gene expression values. Different strategies for GSEA have been introduced already; we chose PAGE which was first proposed by Kim and Volsky (21), because it is relatively straightforward, and accuracy is preserved while computation load is lower. PAGE is based on the Central Limit Theorem (CLT) (22), and according to the CLT, the distribution of the average of randomly sampled n observations tends to follow a normal distribution as n gets larger, whether the parent distribution is normal or not. Here, assuming mean µ and variance σ2 of the parent, then the sample mean will follow a normal distribution with the same mean µ and the variance σ2/n. In this context, the parent can be seen as a set of fold change (FC) values between two experimental groups, the random sample is the GO term where n is the number of genes mapped to the term. Thus, for each term having sufficient number of genes mapped to it, a Z-score value, which is used to infer the statistical significance, can be calculated using the following z-test formula:
is the mean of sample n, i.e. the average of FC of all genes associated with the GO term. As a z-test is two-tailed, a Z-score can be positive or negative. Using R software (23), a Z-score is converted to P-value, and the P-value will be subjected to multiple test correction. The adjusted P-value generated by the correction is one criterion to estimate whether the term is significant. Apart from the adjusted P-value, an additional criterion is applied in PAGE. Either the term has a positive Z-score and the mean of FC of all genes associated with it is ≥1 (upregulated), or the term has negative Z score and the FC mean is ≤1 (downregulated).
Generally, PAGE is more objective than SEA. SEA accepts a user-selected target list and uses the adjusted P-value as a single criterion to decide GO term enrichment. In the case of an inappropriately prepared target list, a misleading result might be generated. In contrast, PAGE accepts an arbitrarily large input-list with FC, and identifies significant GO terms associated with groups of genes with significantly deviated change patterns with respect to all the genes. However, PAGE is only applicable for sample comparisons with quantitative measurements (e.g. mRNA abundance and DNA methylation) and factors including precision of measurement and data normalization will influence the PAGE result. In their application, the two approaches serve for different situations and need special attention to the issues mentioned above.
The ability to present the analysis results in a clear and accessible manner is important in the interpretation step. In EasyGO, a hierarchical tree graph is used to aid the user in checking the results. We expanded this type of output with more content and functionalities, as described below. A cross-comparison function was developed, to enable users to simultaneously compare multiple data sets.
Elaborate graphical output can facilitate users to explore biological meaning in an intuitive way. The direct acyclic graph or tree structure graph based on the nature of GO can indicate terms are over/under-represented and the inter-relationships between terms. Such graph is available in EasyGO and improved in agriGO. We adopted the testing case from EasyGO (12), which comprised of 168 probe sets from Arabidopsis ATH1 GeneChip with all showing upregulated expression in shoot tissue during cold treatment, data from AtGenExpress project (24). We used SEA in agriGO and EasyGO to do the analysis, and both generated a tree structure graph (see Figure 1 and Supplementary Figure S1, respectively). GO terms are represented as boxes containing detailed description, organized and connected based on their relationship (Figure 1). The detailed pages containing further information, such as gene description and protein domain annotation, are also available. In addition, font and rank direction of the tree are customizable.
As a new feature, we now support another graph format–flash bar chart. All terms in the three categories of GO are free to select for comparison using this functionality. By default, all detectable child-terms (secondary level terms) of three root terms (GO:0008150 biological process, GO:0003674 molecular function and GO:0005575 cellular component) and significantly over/under-regulated child-terms of secondary level terms (if any) are selected to construct a flash bar chart. Parameters for chart setting are customizable (e.g. legend content, font and rotation, and bar style). For example, the bar graph is resizable by simple dragging of the border, and color of bars is controllable by users. Appropriate adjustment, like terms selections or parameters settings, can generate customizable and artistic outputs, which allow users to make graphs and figures suitable for publication. To demonstrate, we selected all significant terms in the analysis results of Figure1 to generate a flash bar chart, and further adjusted size and color of the chart (Figure 2). Though displayed in a new method, with a similar biological conclusion, that cold and stress related terms are overrepresented, can be gained by using a flash bar char. The text tree mode is another unique way available for result inspection in agriGO (Supplementary Figure S2A). Furthermore, we developed a flexible way that users can freely select terms to create custom outputs (Figure 2 and Supplementary Figure S2B and C). These methods will provide users a comprehensive way to explore the analysis results and multiple choices for generating images suitable for publication.
Cross-comparison is essential for interpreting results obtained from experiments involving multiple samples, such as time-series experiments, and this novel functionality is enabled both for SEA and PAGE approaches. Through the SEACOMPARE tool, user can submit multiple SEA job identifiers, and analysis results will be combined for cross-comparison purpose. When using PAGE, user can submit a list of genes with multiple numeric values that were each obtained from separate experiments.
As a test case, we selected a group of 1921 Arabidopsis ATH1 probe sets through hierarchical clustering analysis of the cold-treatment microarray data, from AtGenExpress project (24); (Figure 3A is a heat map representing the clustering result of 1921 probe sets). The log2 cold/control ratio of these probe sets at six time-points was used as input for PAGE. The results are represented in HTML table mode. For clarification purpose, we selected certain GO terms using the ‘suppress GO number’ functionality and trimmed out the numerical parts in the image (see the complete snapshot in Supplementary Figure S3). The stress and stimulus-related terms were upregulated and strengthened over time at three later time points (i.e. 6, 12 and 24 h). The transcription factor (i.e. GO:0030528 and GO:0003700) appeared at a relatively early stage (6 h), and were most overrepresented at 12 h, but there were no such activities at the last time point (24 h). Interestingly, two GO terms concerning ‘response to stimulus’ (GO:0042221 and GO:0050896) were even downregulated at a very early stage (0.5 h). We conclude that comparison can offer users the possibility to quickly and efficiently gather important biological knowledge.
The tree graph and flash bar chart can also be used to do the comparisons. Unfortunately, agriGO can only support mutual comparisons using the tree graph (see Supplementary Figure S4); since when more than two data sets are compared, a much more complex color system will be used to display the terms’ changes among different experiments, and this is inconvenient for investigation.
In agriGO, the number of supported organisms and identifiers is substantially increased compared with EasyGO (12). We collected 38 agricultural species including 274 types of corresponding identifiers. The efficiency to map users’ input IDs to GO annotation is benefited by the extensive support to different identifier types. Recently released genome sequence data, of which GO annotations are not available in public databases (e.g. tomato and cucumber) are collected and annotated locally, since genome-wide data sets can provide completely global perspectives of GO distribution.
The GO annotations in agriGO are either obtained from public databases or produced by computational prediction. We run BLAST (25), Pfam (26) or InterProScan (27) to generate GO annotation for those publicly unsupported identifiers with sequences. Annotations for model organisms are downloaded from publicly databases like TAIR (28), Gramene (15), TIGR (14) or from GO repository server including GOA (29), B2G-FAR (30) and AgBase (31) (see Supplementary Table S1 for detail).
Though GO annotations are widely available on the Internet, this is not true for most agricultural species. A GO annotation repository concerning agriculture, AgBase (31), has been established, however, non-model and newly sequenced organisms only have limited support. Therefore, we provide free download and search functions for our annotation data sets as a GO annotation resource for the agricultural research community. Furthermore, we developed a tool called BLAST4ID (Transfer IDs by BLAST) providing a BLAST service, which can be used to do ID mapping for unknown/unannotated identifiers. It can also work as a connection between unidentified IDs and analysis tools in agriGO, for example, users can apply BLAST4ID to generate GO-annotated gene list, and upload the list to do the analysis. However such automatic matching is likely to generate false positives, and thus caution is required when using BLAST4ID.
The web interface and usability of agriGO has been totally re-engineered. The interactive help system makes agriGO more user-friendly. For example, once the user selects one species, the supported identifier types will be displayed to help users to judge whether their identifiers can be submitted directly or need to be transformed using BLAST4ID. In addition, the identifiers can be automatically recognized without further efforts so that different types of identifiers from one species can be used for one analysis.
We constructed and configured agriGO upon a typical LAMP (Linux + Apache + MySQL + PHP) platform. Data set was stored in MySQL 5.0 (www.mysql.com), and the web interface was built by PHP scripts (www.php.net) on Red Hat Linux, powered by an Apache server (www.apache.org). Server-side scripts were developed using Python (www.python.org). The hierarchical tree images were generated using Graphviz software (www.graphviz.org) and the flash bar charts were achieved by Open Flash Chart software (http://teethgrinder.co.uk/open-flash-chart-2). The tool is web-based, and no software or plug-in installation effort is required to use it.
We perform regular updates and maintenance to agriGO. As most of the annotation and sequence data is obtained from publicly available databases, manual effects and Python scripts are used to semi-automatically oversee and download source files, to ensure agriGO provides the most up-to-date data. New agricultural species and functions can be added upon request.
One goal of developing agriGO is to provide EasyGO users better service, and the consistency of analysis conclusions is important. We tested agriGO and EasyGO with the same data set (Figure 1 and Supplementary Figure S1), and the conclusions were similar with slight differences caused by updated annotation in agriGO. However, because of different software architectures, EasyGO and agriGO are not compatible in results exploration, i.e. users can not inspect the results generated by EasyGO in agriGO, and vice versa.
One issue is that a lot of GO terms may be detected in the analysis results, which will cause inconvenience in exploration and explanation of the graphical results. To avoid this issue, ‘GO slim limitation’ in the advanced options can be selected. Alternatively, users can produce custom graph results using custom settings of GO terms (Supplementary Figure S2).
GO annotation coverage is another critical question for GO functional analysis tools including agriGO, as discussed by Berg et al. (13). Except for well-studied model organisms like Arabidopsis, GO annotations are mainly generated by computational prediction. Such prediction may lead to two issues: reduced quality of annotation and low annotation coverage. Poor-quality annotation can directly affect the GO distribution and, if not prepared cautiously, can generate biased or misleading analysis results. One issue is that by using a single BLAST search, even with high BLAST scores, it is not guaranteed that sequences will be annotated correctly, thus users should be alert when using BLAST4ID. Effective annotation will be hampered by some sequences that have neither high similarity to already known sequences (for BLAST search) nor sequence signatures (for tools based on pattern recognition). Empirically, automatic annotation methods, e.g. the combination of BLAST (E-value ≤ 1e-30 and Coverage ≥ 0.7) and InterproScan (E-value ≤ 1e-3), can only annotate ~60% of all protein sequences predicted from one newly sequenced genome by GO. One promising way to overcome these problems is to use a similar annotation strategy to Meng et al. (32) by performing comprehensive manual curation. However, such a great workload seems unrealistic for most GO enrichment tools as they may maintain dozens of species. A good approach or resource to generate high-quality GO annotation data for non-model organisms is greatly needed.
Compared to EasyGO, agriGO offers vast improvements; the functionalities have been carefully tested and it has completed >4800 analysis requests since its release. We believe that agriGO will facilitate researchers in the agricultural community to extract biological meanings from data of high-throughput experiments in an easy and systematic way. This new application is freely accessible now at http://bioinfo.cau.edu.cn/agriGO/.
Supplementary Data are available at NAR Online.
Funding for open access: Ministry of Science and Technology of China (90817006 and 2006CB100105).
The authors thank Ms Wenying Xu for discussions and critical suggestions. The authors thank Yan Zhang for discussion on logo design. The authors thank anonymous reviewers for their valuable contributions and comments on an earlier version of this article.