Data from two cancer biology studies with the aim of defining the phenotype of specific populations of neoplastically-transformed cells were used to demonstrate the utility of GOModeler
]. The GOModeler
results from the first study involving a set of nine differentially-expressed cytokine genes are shown in Figure . Results for the second study involving ten differentially-expressed genes are given in Additional file 1
. For the GOModeler
analysis, the species
selected was chicken,
the gene input type
was gene name
and the other parameters selected were positive
as the option for default effect for unsigned GO terms
and positives override
as the option for conflict resolution.
Table shows a comparison of the qualitative effects obtained using GOModeler
and the results obtained by manual analysis by a PhD level immunologist. The direction of the net effects obtained by GOModeler
is in agreement with 75% of the results obtained by the manual analysis by an immunologist. The results from GOModeler
and the manual analysis differ for the hypothesis terms apoptosis and antigen presentation
and inspection of the results in edit mode reveals that two of the entries (IL-6 and IL-10) had conflicting effects (−1 and +1) for apoptosis.
Because we had selected positives override
as the conflict resolution mechanism, the tool had selected +1 over −1. The user has the ability to change the effect for individual cells in edit mode.
Comparison of GOModeler results for the test cytokine dataset with manual analysis results. Columns with the heading G were generated by GOModeler and those with the heading of M were obtained by manual annotation by an immunologist.
Other differences between the manual scoring and GOModeler
can be attributed to incompleteness of the GO [18
]; that is, published data exists and GO terms exist but the two have not been linked by GO biocurators. For example, it is obvious that our immunology domain expert has brought substantial knowledge to bear about the hypothesis term antigen presentation
that is not yet annotated and present in the GO databases for these genes. In some cases, such as the effect of IL-6 on chemotaxis, GOModeler
and the domain expert found opposite effects. Manual inspection of the GO annotation of IL-6 confirms that GOModeler
is obtaining the correct effect based on information available in the GO annotation of IL-6 for rat, mouse, human and chicken. This is a specific example where the GO is incomplete and also where the effect is context dependent and so the gene product effect can be positive in some cases and negative in others.
In general, GOModeler tends to identify more positive effects than negative effects. This phenomenon can occur when genes have conflicting positive and negative (pro and anti) effects specified by the GO (indicated by “(+1/−1)” in the qualitative table) and, additionally, a bias is introduced by selecting “positives override” for conflict resolution. This is obviously something that users must be aware of and it is reasonable to assume that they will be. However, there are more complicated factors also. For example most GO terms are “unsigned”, i.e. they do not indicate positive or negative. We have opted to use “positive” as the default effect for unsigned GO terms based on colloquial usage and our experience that “positive effects” are often implicit. An example is the physical manifestation of programmed cell death known as apoptosis. “positive regulation of apoptosis” (GO:0043065) and “negative regulation of apoptosis” (GO:0043066) can be used to indicate pro and anti effects by annotators. However, authors can imply that a gene positively regulates apoptosis without explicitly saying so. In such cases the GO term “apoptosis” (GO:0006915) is used for annotation and yet the domain specific experts will know that this is a positive regulation. Finally, scientists tend to publish their positive data and to make hypotheses in a positive sense.
Unlike most GO-based discovery tools [6
] that focus on the under- or over-representation of GO categories, GOModeler
supports hypotheses testing using the GO. Although modern high throughput methods support discovery-based science, hypothesis driven science remains the approach used by most molecular biologists and required for funding from many agencies (e.g. NIH). GOEA tools can be used to generate an initial list of hypotheses, but they are of limited value for hypothesis testing. GOEA tools can typically only identify hypotheses for GO terms that are over and underrepresented in a dataset. These statistical approaches are not applicable for analysis of biological processes that involve only a few genes because the GO terms involved occur in such small numbers that they will never be identified as “over-or under-represented” by statistical analysis. By contrast GOModeler
can identify such effects. In addition, most GOEA tools limit their analysis to a specific and arbitrary level of the GO DAG (i.e. GO Slim categories). These categories are often so general that they can be of little use in hypothesis-driven research. Some GOEA tools allow the user to “drill down” from the GO Slim categories, but as the GO terms become more specific, there will be fewer genes with these annotations, making it highly unlikely that they will pass the statistical tests for under- or over- representation. In addition, selection of an arbitrary level of detail falsely assumes that all terms at the same level in the GO DAG hierarchy are at the same conceptual level [18
]. In addition, some parts of the GO are much better developed than others. Tools that focus on over- or under-represented GO terms provide no information about the direction of effect of the dataset on the categories. Although GOModeler
is a GO-based tool, the issues addressed by GOModeler
and GOEA tools are fundamentally different and each has their own uses.
To illustrate the differences in GOModeler and GOEA tools, we have used the dataset in Figure with three popular GOEA tools, AgriGO, GOStat and DAVID. GOStat and DAVID reported only high level terms (e.g. immune response, extracellular region, abiotic response to stress) and did not “discover” any of the hypothesis terms in our dataset. With our small set of genes, AgriGO generates a message stating that “Sorry, less than 10 entries can be mapped with GO. Analysis Failed.” GOModeler, on the other hand, does not rely on statistical over or under representation and allows users to control the level of specificity appropriate for testing their hypotheses.
One limitation that GOModeler
shares not only with GOEA tools but also with network and pathway tools such as Ingenuity Pathway Analysis (http://www.ingenuity.com/
) is that the computed quantitative effects provide a simplified view of gene effects. All of these methods ignore complexities introduced by differential gene effects on gene pathways, biological processes and molecular functions (though they can take cell location into account). Adding to this complexity is the contextual relative effect of a gene product. This view does, however, allow us to show the direct effects of relative expression between two comparable systems (control versus treatment) i.e. genes that are much more highly expressed in the treatment system will have higher quantitative effects compared with the control system. We have already successfully applied this approach in several published papers that used a preliminary version of GOModeler
with substantial user input [19
is not a gene expression analysis tool and an essential underlying assumption of GOModeler
is that appropriate statistical analysis of differential gene product expression has been done. This is completely compatible with reductionist approaches and GOModeler
’s utility is to quickly survey the GO to assign terms from one of the three ontologies based on the user’s hypothesis terms at the most appropriate and granular level of the GO. For example, there are currently only six genes annotated in the GO to be involved in angiogenesis.
Reductionist biologists could test a hypothesis about genetic regulation of angiogenesis by, for example, quantitative PCR of these six genes. Although we often think of HT methods as associated with discovery based-science, a HT functional genomics experiment (such as RNAseq) would also measure the same six mRNAs and could be used for hypothesis driven research. As demonstrated by our examples, GOModeler
can be used for the reductionist approach [20
] or for a HT functional genomics approach [19
]. HT functional genomics experiments, however, allow many other genes to be tested for other hypotheses using GOModeler.