We examined the newly developed biological module-centric tools (see Additional data file 8 for a graphical tutorial of using the tools) on two published microarray datasets. It is important to mention that, to avoid potential bias, the datasets of the case studies are different from those used during algorithm development. For the first microarray dataset [35
], G1 response genes were identified by microarray experiments after introducing G1 cyclin Cln3p to cln-
yeast cells that were previously arrested with cdc34-2. For comparison, the dataset was analyzed by tools with very different algorithms, that is, DAVID Tools [26
], GoMiner [16
], Ontologizer [33
], GOStat [3
], ermineJ [36
], ADGO [37
] and GENECODIS [38
]. All tools are able to highlight the major terms (for example, cell cycle, DNA repair, DNA replication, budding, and so on), consistent with previously published observations. However, the DAVID methods are more sensitive to a couple of additional important terms (for example, cyclin-dependant kinase activity, mating, and so on) that were not found among the top terms in the output from the other tools. For more detailed results, comparisons and discussion, see Additional data file 14.
The following detailed discussion is mainly focused on the second microarray dataset [39
], of which the gene list is available as demo list 2 on our tool entry page. In this example dataset, authors treated freshly isolated peripheral blood mononuclear cells (PBMCs) with an HIV envelope protein (gp120) and further measured genome-wide gene expression changes using Affymetrix U95A chips [40
]. This study provides a global view of the complex interaction between viral and cellular factors, which is an essential mechanism for HIV replication in resting or suboptimally activated PBMCs. A functionally significant annotation of approximately 400 genes (Additional data file 1) derived from the microarray experiment was classified by the authors into five major functional categories: cytokines, chemokines, transcription factors, kinases, and membrane fusion [39
]. While the cytokine and chemokine categories were systematically highlighted by EASE (a GO enrichment analysis based on the Fisher Exact Test) [2
], other annotation categories reported in the publication were discovered through semi-manual analysis by bioinformatics experts with an advanced level of knowledge of both biology and computer tools.
The same data re-analyzed by typical functional annotation tools
After the continuous addition of annotations for genes as well as the refinement of gene-term enrichment algorithms during the years since the above study [39
] was published, it is interesting to see how the systematic results from current functional annotation tools compare to those reported in this publication. Some of the popular functional annotation tools, such as DAVID Gene Functional Annotation Tool, GOStat, GoMiner, TopGO, Ontologizer, ADGO and GENECODIS [1
], were chosen to identify major biological terms with the same gene list. In order to maximally reflect the design spirit of each tool and also make the results more comparable, we kept all default parameters of the tools unchanged, except for synchronizing the data coverage scope within all GO levels (DAVID covers multiple data sources and GOstat covers GO level 3 or above by default). Although all of the testing tools are based on similar gene-term enrichment algorithms, the sensitivity and specificity could be different due to different updates of GO data content, different background gene lists, different score systems, different gene ID mapping schemes, and so on. After obtaining hundreds of annotation terms reported by each of the above tools, the terms, particularly at the top of the results, were compared with each other (Table ). Approximately 30% of the top terms overlapped between at least two of the tools, for example, cytokine/chemokine activity, inflammatory response, and so on. Some reported terms, for example, kinase, are not ranked at the top by any of the tools (that is GOMiner, 49; DAVID, 24; GOStat, 82; topGO, 76; Ontologizer, 111).
The top 20 enriched terms for demo list 2 by various traditional functional annotation tools
Even though the results from the tools all point in the same biological direction, there are four obvious problems. First, redundant/similar/hierarchical terms appear in different (significance) positions within the reports (for example, response to stress, response to wounding, response to pathogenic bacteria, response to other organisms, response to external biotic stimulus, inflammatory response, and so on), which makes it difficult for the user to gain or maintain a clear focus of the whole biological picture. It is not easy for users to comprehensively pool all genes related to the same key biology without manually summarizing all related redundant terms. Second, the redundant/similar/hierarchical terms could largely dilute the focus on other key biology that has few or no redundancies (for example, only one term is for establishment of cellular localization). If several redundant/similar/hierarchical terms are represented in the top of the list, less redundant terms may be pushed down the list, possibly decreasing the chance of discovery; for example, a transcription regulation term, reported in an original publication, was not listed in the top 20 by any of the tools. Third, in contrast, due to differences of the annotation levels of different sources, redundant/similar/hierarchical terms may themselves be diluted. While alone a single term may not be at the top of the list, in combination with redundant/similar/hierarchical terms, the biological function may be very significant. Fourth, current tools do not emphasize the inter-relationships between key biological terms (for example, relationships between chemokine/cytokine and signal transduction).
In conclusion, the recent improvement of functional annotation tools provides a powerful means for users to systematically identify key biological functions associated with a gene list. However, due to the weaknesses discussed above, refinement of current gene-term enrichment algorithms and improvement of software usability alone may not address all the issues. Therefore, the development of novel alternative algorithms as a complement is still very necessary.
The same data analyzed by the DAVID Gene Functional Classification Tool
The same gene list (Additional data file 1) was submitted to our newly developed DAVID Gene Functional Classification Tool described previously (Additional data file 8). The tool is able to efficiently handle up to 3,000 genes at a time, within a few seconds. The tool classified the approximately 400 genes into 16 functional groups (Table and Additional data file 2). The result is much more focused, simplified, and in a manageable size for investigators' interpretation compared to working with a few hundred terms, of which many are redundant in results derived using the traditional tools discussed in the previous section. More importantly, all five reported annotation categories are covered by the 16 functional groups (Table ). In addition, the tool also lists another 11 interesting gene groups not reported in the original publication. For example, group 13 (tubulin genes) plays a critical role in the nucleation of microtubule assembly. Some studies suggest that HIV infection leads to enteric microtubule depolymerization of infected cells, resulting in increases in HIV permeability [41
]. This tool focuses on the overall major common annotation terms associated with a gene group rather than one term or one gene at a time, thereby producing clearer, more concise results that can better allow for focus on the major biology of an experiment. The tool simplifies the results by condensing the redundant terms and summarizing inter-relationships. This analytical logic and presentation format closely mimics how the human brain works and the results better represent the nature of biology.
Sixteen total gene functional groups identified by the Functional Classification Tool
The DAVID Gene Functional Classification Tool allows users to further explore a given biological module/gene group in depth. For example, the 'enriched terms' button '2-D View' is able to list all related terms and genes for the kinase group. Thus, a user who is not familiar with kinases can explore the terms of kinase activity, transferase activity, ATP-binding, nucleotide binding, protein metabolism, tyrosine specificity, serine/threonine specificity, regulation of G protein signaling, and signal transduction, and so on in one view at the same time (Figure ). Therefore, we can quickly learn the biology for the kinase group with the above related terms in a single view and also identify the fine differences among them. For example, there are two G-protein coupling receptor kinases, three protein tyrosine kinases and six kinases involved in cell surface receptor-linked signal transduction among the 23 kinases within the group (Figure ). The fine details may be very important for pinpointing the key biology associated with a study.
Furthermore, the DAVID Gene Functional Classification Tool allows one gene to be present in more than one functional group, which closely reflects the nature of biology whereby one gene could play multiple roles in different processes. This fuzziness feature improves the chances of discovery by maximally preserving all of the true relationships. For example, general transcription factor II H
) was assigned to group 2 (transcription regulation group) and group 5 (DNA damage/repair group) (Additional data file 2). Some studies suggest TFIIH
increases polymerase processivity in HIV infection [42
]. Currently, there are few reports about the TFIIH
DNA repair mechanism being involved in HIV infection, although this DNA repair mechanism could be essential in HIV integration. Hence, the fuzzy capability allows users not only to focus on the TFIIH
transcription regulation role but also to consider the possible role in HIV integration through the DNA repair mechanism. For another example, ring finger protein 40
) is in group 2 (transcription regulation group) and group 10 (chromosome assembly) (Additional data file 2). Although the biological significance of the ring finger protein in HIV infection is still largely unclear, the annotation result points out two potential areas for further exploration: first, the ring finger protein regulates the tumor necrosis factor-related transcriptional pathway, which is critical to many aspects of HIV transcription; and second, it plays some role in DNA packaging and chromosome integration. Thus, the fuzziness capability is a powerful feature to maximally preserve biological patterns and to discover fine differences for a given gene compared to exclusive methods.
The sensitivity of the Functional Classification Tool can vary with different datasets and stringency criteria. If the running criteria are not suitable to a particular dataset, the output can be distorted. In such cases, some exploration of different running stringencies is necessary in order to obtain the optimal results to meet the expectation of the study.
The same data analyzed by the Functional Annotation Clustering Tool
Due to the redundancy/hierarchy problems in the results obtained from traditional annotation tools (Table ), a Functional Annotation Clustering Tool was also developed to organize the highly redundant annotation term results into a simplified and clustered format. This new format allows investigators to focus on an annotation group level by quickly skipping many redundant/similar/hierarchical terms within the group. Compared to 222 individual terms reported by the DAVID Functional Annotation Tool, a traditional term-centric enrichment method, the new tool was able to organize them into 65 annotation clusters (Additional data file 3). For example, the annotation cluster 3 (immune-response group) consists of 11 redundant/similar/hierarchical terms; that is, response to stress, inflammatory response, response to external stimulus, response to pest, pathogen or parasite, and so on. These similar terms are spread throughout the traditional term-centric enrichment report list of 222 terms. Most importantly, the top 20 annotation clusters with a group enrichment score less than or equal to 0.05 (Table and Additional data file 3) contain all annotation categories reported by the original publication, as well as interesting groups not identified. The highly organized and simplified annotation results allow users to quickly focus on the major biology at an annotation cluster level instead of trying to come to the same conclusions by putting together pieces that are scattered throughout a list of hundreds of terms. In addition, the annotation cluster is helpful in comprehensively pooling all related genes associated with an annotation cluster consisting of many related terms. For example, each of the 11 terms within cluster 3 (immune-response cluster) associates with different genes. A pooled gene list brought together by cluster 3 regarding immune-response could be much more comprehensive, compared to the genes selected from one or a few individual terms. Moreover, the tool could possibly bring up the terms not passing the minimum enrichment threshold but highly related to other terms with significant enrichment scores. In conclusion, the clustered result condenses the data into smaller, much more organized biological term modules, which allows investigators to quickly and comprehensively focus on the key biology of interest.
The top 20 annotation clusters identified by the DAVID Functional Annotation Clustering Tool