shows the output of a typical keyword analysis with Martini. In this case, Martini was given two input sets of genes—the first set contained 269 Arabidopsis genes known to be associated with disease resistance mechanisms; the second set consisted of 514 genes with no clear link to disease. Martini found 60 keywords that were significantly over-represented in either of the two input sets (). Manually checking each keyword, we considered the majority (48 out of 60) to be true positives, i.e. to be clearly related to disease resistance mechanisms in plants. For example, Pseudomonas is a common plant pathogen, and salicylic acid is a phytohormone that is used by plants in triggering the defense-signaling pathway.
Figure 1. Martini keyword output for the Arabidopsis dataset. All significantly enhanced keywords are shown first as a ‘keyword cloud’, where the size of each keyword is proportional to its statistical significance. The keywords assigned to input (more ...)
The 12 keywords that were not true positives were: access, allele, cause, cognate, cross, enzyme, experiment, gene product, nucleotide, selected, situation and ursus sp. We considered that none of these satisfied the criteria for false positives (see ‘Methods’ section), hence we classified them as ‘uninformative’. Most of these 12 are too generic to be properly considered as ‘keywords’, and in future versions of Martini we plan to automatically blacklist such uninformative terms.
For comparison, the Arabidopsis datasets were also analyzed using FatiGO, Marmite and ProfCom, and in each case exactly zero terms were found.
shows the time taken for Martini keyword enhancement. Generally, the time taken scales better than linearly with input size, however datasets involving many well-studied genes will be slower than this estimate.
We next tested the keyword enhancement feature of Martini on a set of 600 human cell-cycle-regulated genes (20
). The human cell cycle is relatively well-studied and understood, and many of the genes in this data set are well-characterized (98% are linked to Medline abstracts describing their function and 86% have GO annotations levels 3–9 in the GO ontology). Thus, we may expect not only that methods such as Martini should perform well with these data, but also that this set may be a good benchmark, since it should be straightforward to assess the accuracy of the resulting keywords and GO terms.
Each of these 600 genes has been assigned to a specific time point within the cell cycle at which the gene is maximally expressed (20
). These time points are given as a percentage of cell-cycle progress rather than hours since the cycle duration varies between growth conditions. To construct pairs of gene sets, we used a sliding window spanning 10% of the cell cycle, and we compared all genes within the window with the remaining cell-cycle genes. Sliding the window in 1% steps, we generated 100 Martini keyword analyses spanning the entire cell cycle.
In , these results are arranged in a cyclic layout (see ‘Methods’ section), where each keyword has been placed to show the exact region of the cell cycle where the keyword is significantly over-represented. The keywords cluster into three distinct groups: (i) a pre-replication phase (late G1, corresponding to cell-cycle progress from 41 to 52% in ) defined by keywords that describe the initiation of DNA replication and the checkpoints that can prevent initiation from taking place; (ii) S-phase (53–63%), defined by keywords that describe the proteins, complexes and processes associated with the replication machinery; (iii) M-phase (79–100%), which has no keywords for proteins or complexes, but has keywords that describe the cell division sub-processes. In G1 and G2 phase (1–40% and 64–78%, respectively) no enhanced keywords are seen, consistent with the generally-accepted belief that relatively few processes are specific to these ‘gap’ phases.
Figure 2. Keywords found by Martini from cell-cycle genes. The figure shows all keywords found by Martini using 600 cell-cycle-regulated genes that have been experimentally assigned to specific time points within the human cell cycle. Percentage numbering indicates (more ...)
Assessed qualitatively, shows a surprisingly accurate and precise match to the events and entities known to occur at different stages of the cell cycle. Of the 72 total keywords found by Martini, we considered 67 to be ‘true positives’, i.e. to occur at the correct position in the cell cycle. The remaining five keywords—‘874 Amino Acids’, ‘Extractable’, ‘Femtomole’, ‘Tungsten’, ‘20 specific protein’—we would classified as ‘uninformative’ rather than ‘false positives’, since these keywords do not imply incorrect processes or entities.
To quantitate the accuracy and precision of the keywords and terms, we divided the 600 genes into four groups corresponding to the classic phases G1 (cell cycle progress from 1 to 40% in , giving 113 genes), S (41–63%, 154 genes), G2 (64–78%, 82 genes) and M (79–100%, 251 genes). These gene sets were then used to perform a much simpler four-step analysis, shown in , where we compared the genes in each phase with those in the other three phases (e.g. G1 versus S+G2+M, etc.). For each of the tools, we then manually classified each term found as either true positive, false positive or uninformative using the following criteria: True positives are keywords that have definitely been assigned to the correct cell-cycle phase, i.e. they match to processes or entities known to occur specifically within that phase. False positives are keywords that match to cell-cycle processes, but have definitely been assigned to the incorrect phase, e.g. FatiGO finds the term ‘M phase’ associated with G1 genes. Since the dataset was defined as genes specific to the mitotic cell cycle, we considered any meiosis-specific keywords to be false positives. Finally, uninformative keywords are those that are not clearly right (true positive) and not clearly wrong (false positive).
Cell-cycle keywords and GO terms
CoPub cannot compare two lists, and the results shown were generated effectively by comparing each of the four gene subsets against the background of all other human genes. As expected, CoPub gives less precise results with more false positives. In fact due to space limitations in , we show only ‘biological processes’ from CoPub; including the other CoPub categories (‘drug’, ‘pathway’, ‘disease’ and ‘liver pathology’) gives nearly twice as many keywords with a similar pattern of true and false positives.
Some of the keywords we classified as uninformative could arguably be regarded as false positives. For example, CoPub finds ‘G2 checkpoint’ and ‘G2/M checkpoint’ associated with M-phase genes, however, since these terms describe a process happening between two phases, in this simple four-state analysis, we considered such terms to be neither clearly right or wrong. Similarly, the Rb:E2F-1:DP-1 transcription factor found by FatiGO belongs to the switch from G1 to S phase. Terms such as ‘cell cycle’, ‘cell cycle checkpoint’ and ‘hydrolase’ are not incorrect, but since they refer to processes throughout the entire cycle, it is also not correct to assign them to a single cell-cycle phase. Another borderline case is ‘DNA damage’, which is a key feature of S-phase, but is also present in other phases, hence we regarded it as a true positive if it occurs in S-phase, but as uninformative for other phases. CoPub also finds terms such as ‘lung development’ that appear to be incorrect, given how the gene set was defined, however since this term does not clearly match to any specific cell-cycle process, we categorized it as ‘uninformative’.
To calculate a recall score, we created a benchmark or ‘score card’ that defines 20 main phases, sub-processes and key components in the human cell cycle (). Each true positive in was then mapped onto one row of , allowing us to count non-redundant true positives (tp), and also to count false negatives (fn, i.e. rows in for which a tool has no matching keywords). The recall was then calculated as tp/(tp + fn), and the precision calculated as tp/(tp + fp), where fp stands for the number of false positives in . Note that the number of false positives has no clear limit, hence the precision score used here is an estimate of the ‘true’ precision.
Cell-cycle benchmark and score-card
Of the five tools tested against this benchmark, Martini clearly gave the best performance, with 60% recall and 100% precision. CoPub found many more keywords and had similarly good recall (60%), but only 17% precision (i.e. many false positives). FatiGO also found more keywords than Martini, but had lower recall (25%) and lower precision (45%). Marmite found zero terms in all of the phases, while ProfCom found only the single term ‘hydrolase activity’ that we judged to be uninformative.
We next tested keyword enhancement using two gene sets, one associated with primary melanoma and another with metastatic melanoma (4
). In contrast to the very specific comparisons of cell-cycle phases in , comparing these two types of melanoma corresponds to asking a more general question. We considered the melanoma dataset to be not a good candidate as a benchmark, unlike the cell-cycle dataset, but probably a more realistic or typical case.
compares the output of FatiGO, Marmite, Martini and ProfCom with these data. We manually classified each keyword found as either mitosis-related, uninformative, or ‘not mitosis-related’ using the following criteria (different to the cell-cycle criteria). Mitosis-related keywords have a clear relation to the major mitosis-specific processes. Since mitotic cell division is what we would expect to see associated with metastatic cancer, we considered these keywords to be true positives. Uninformative keywords were either too generic (e.g. ‘assemblies’ or ‘biogenesis’), or related to experimental techniques (e.g. ‘co-immunoprecipitation’), or related to model organisms (‘cerevisiae’ or ‘sporulation’). Any remaining keywords were classified as Not mitosis-related. Keywords in this final category are the most interesting as their connection to melanoma and metastasis is, in many cases, not immediately obvious. In contrast to Arabidopsis and the human cell-cycle, where many of us have extensive experience, we had little previous experience with the melanoma literature, and hence we were less confident in deciding true and false positives.
Keywords for metastatic versus primary melanoma
Martini found 264 significantly-enhanced keywords, a much larger number than the other methods (). Of the keywords found by Martini, 109 were mitosis-related and 79 were uninformative. This left 76 keywords assigned as ‘not mitosis-related’; for each of these we manually checked the literature for evidence of a connection to melanoma or metastasis. For some keywords, this connection was straightforward, e.g. skin, cornea, lymphoid, HeLa cells, desmosome, intermediate filaments, involucrin, calcium, as well as several skin diseases. For other keywords, the connection was less obvious, but was supported by the literature: e.g. polyploidy (24
), cornification and bone marrow cells (25
), heat-shock/chaperone proteins (26
), cystic fibrosis (27
), ATM kinases (28
), CHK1 (29
), neurites (30
). Perhaps the most interesting keywords found by Martini are the names of several of the MAGE (melanoma-associated genes) gene family. These genes are normally expressed only in developing sperm, where they play a role in meiotic cell division. However, these genes are also expressed in melanoma (31
FatiGO found 4 transcription factors and 47 GO terms, of which 36 were classified as not mitosis-related (). As with Martini, all the terms in the ‘not mitosis-related’ category seemed to have a general connection to melanoma or metastasis, hence none were obviously false positives. Interestingly, FatiGO does not find the link to spermatogensis.
Comparing Martini and FatiGO qualitatively, both seemed to have similar precision with this dataset, i.e. all terms and keywords found were either uninformative or, as best as we could judge true positives, correctly indicating a connection to melanoma or metastasis. Martini found many more keywords, more-specific keywords and also more uninformative keywords. Martini also found many processes related to melanoma and metastasis that were not found by FatiGO. Thus, we conclude that Martini had qualitatively a somewhat higher recall, however, unlike the human cell cycle, we cannot quantify this since we do not have the background to construct a benchmark covering all the major processes and components involved. Marmite and ProfCom did not perform well with this dataset, finding almost no terms ().
Ovarian cancer dataset
As a final test of keyword enhancement, we used FatiGO, Marmite, Martini and ProfCom to compare a set of 160 genes associated with clear-cell ovarian cancer (i.e. cells that are clear when viewed through a microscope), and a second set of 105 genes associated with non-clear-cell ovarian cancer. For this comparison, each of the tools found exactly zero significantly enhanced keywords or GO terms.