Identification of species–word associations
We first identified the distribution of phenotypic traits across species by analyzing the co-mentioning of species and trait-descriptive words across MEDLINE abstracts. Namely, nouns that preferentially co-occur with a subset of species are likely to be trait-descriptive (e.g., the words “flagellum” and “motility” are enriched in abstracts dealing with motile species; see ). We focused on nouns because these presumably carry a more considerable proportion of the relevant information represented in the scientific literature than verbs and adjectives [28
]. Nouns were extracted from MEDLINE abstracts using a part-of-speech tagger (i.e., Tree Tagger [29
]). Words of five characters or less were excluded from the analysis, as many of those are gene names and other noninformative words leading to an increase in noise. Species names were taken from the corresponding MeSH terms associated with the abstracts, that is, from the MeSH B category corresponding to “organisms” (applying the controlled MeSH vocabulary reduces errors in species name recognition—for example, in the case of synonym usage; on the other hand, using all nouns in MEDLINE abstracts for identifying trait-descriptive words allows searching a variety of traits not accessible via a controlled vocabulary). Some species were not represented in MeSH, and were thus mapped to their genus. A total of 255,249 MEDLINE abstracts connected with any of the 92 species analyzed were considered in the analysis. We considered the occurrence of distinct species in abstracts. Frequencies of words within abstracts were not taken into account (single and multiple occurrences were equally treated as “word presence”). Given the set of n1
words and n2
species associated with an abstract, we counted all possible species–word pairs (n1×n2).
For each species–word pair, a species–word association score ssw
was determined using a regularized log-odds score:
where ns is the number of abstracts mentioning a particular species, nw refers to the number of abstracts mentioning a particular word, nsw is the number of abstracts that co-mention species and word, and N is the sum of all nsw. The log-odds framework quantifies correlation strengths and, in particular, facilitates the handling of species or words for which only sparse scientific literature exists. To allow the handling of sparse data, the standard log-odds formula was augmented with pseudocounts, p = 1. The resulting score, ssw, yields positive values for enriched species–word pairs and negative values for underrepresented pairs. The magnitude of the score provides a measure of the strength of the association, indicating its potential relevance in describing a characteristic trait. To record overrepresentation, the species–word association score requires frequently used words and species (such as “flagellum” and Escherichia coli) to be co-mentioned more often than infrequent ones (e.g., “oligosaccharide” and Ralstonia solanacearum).
Associations were calculated for each species–word pair, and a species association vector was subsequently constructed for each word, representing its association scores with each of the 92 prokaryotic species studied.
Mapping genes to phenotypes and vice versa
Before mapping genes and traits, further filtering was applied to diminish the contribution of rarely occurring OGs and words; that is, only OGs occurring in at least four distantly related species clades were considered; similarly, we focused on words yielding positive species–word associations in at least four distantly related species clades (thereby utilizing clades of closely related species from STRING [8
]; see Table S1
]). OGs encoding phage-associated proteins (i.e., those with description lines including the terms “phage”, “transposase”, and “integrase”) were regarded as a source of “contamination” within the genomes of analysed species and thus ignored. Finally, we eliminated words that did not display sufficiently strong association with any of the species studied; this was done by removing all but the 1,000 longest transformed species–word vectors (these were considered to be the most informative vectors). The remaining vectors were normalized to a length of 1; similarly, all species–OG vectors were normalized. Subsequently, the pair-wise similarity of each word–OG pair was computed, that is, the word–OG association score, defined as the inner product of normalized species–OG and species–word vectors (the highest word–OG association score obtained is 0.812; see ). Furthermore, the similarity score for pairs of OGs and for pairs of words was computed in the same way as described for the word–OG associations. Using means linkage clustering analysis as implemented in OC [31
] (“similarity mode”; cutoff = 0.45), sets of words and OGs were independently generated. Finally, combined clusters were constructed by combining word sets with OG sets, if these were linked by at least one significant word–OG association. Note that this “loose” clustering procedure allows word sets to be combined with several OG sets, and vice versa (i.e., words or OGs may in principle be part of several clusters).
Assessment of prediction quality
We reasoned that the quality of the predictions may be examined using an orthogonal strategy—comparing the predicted word–OG associations to previously established trait–gene relationships, which can be extracted from MEDLINE when focusing on significantly associated word–gene pairs. Namely, previously established relations were extracted from scientific abstracts, by detecting significant co-mentioning of trait-descriptive words and gene names, using the hypergeometric distribution [32
]. Thereby, gene names were associated to species, considering MeSH terms and organism names occurring in abstracts, and including gene synonyms retrieved from http://www.bork.embl-heidelberg.de/synonyms
. (For instance, the word “flagellum” co-occurs significantly with fliR
from S. typhimurium, p
= 0.00063, consistent with a gene function in motility). We assume confirmation of a predicted word–OG association (“true positive”),
if any gene within an OG significantly co-occurs with a word (i.e., when p
, roughly corresponding to p
≤ 0.03). demonstrates the enrichment of true positives among the highest scoring predictions. Words not significantly associated with any gene, or OGs lacking genes significantly linked to any word were ignored. Furthermore, we conservatively estimated expected fractions of true positives by shuffling both OGs and the 1,000 most informative words, and subsequently repeating the assessment with previously established trait–gene relationships on these randomized associations. By comparing to expected scores, we estimated two significance thresholds: word–OG association scores ≥0.5675 (true positives are 5-fold enriched over expectation) are regarded as significant; scores ≥0.6125 (7.5-fold enrichment) indicate “high-confidence”. OGs and words discussed in detail (see, e.g., , , and 4 and and ) all contribute with at least one high-confidence association to the respective clusters.
The following species were analyzed:
Aeropyrum pernix, Agrobacterium tumefaciens (Cereon), A. tumefaciens (Wash.), Aquifex aeolicus, Archaeoglobus fulgidus, Bacillus halodurans, B. subtilis, Bifidobacterium longum, Borrelia burgdorferi, Bradyrhizobium japonicum, Brucella melitensis, Buchnera aphidicola, B. aphidicola Schiz, Campylobacter jejuni, Caulobacter crescentus, Chlamydia muridarum, C. trachomatis, Chlamydophila pneumoniae AR39, C. pneumoniae CWL029, C. pneumoniae J138, Clostridium acetobutylicum, C. perfringens, Corynebacterium efficiens, C. glutamicum, Deinococcus radiodurans, Escherichia coli K12, E. coli O157:H7, E. coli O157:H7 EDL933, E. coli O6, Fusobacterium nucleatum, Haemophilus influenzae, Halobacterium sp. NRC-1, Helicobacter pylori 26695, Lactococcus lactis, Leptospira interrogans, Listeria innocua, L. monocytogenes, Mesorhizobium loti, Methanococcus jannaschii, Methanosarcina acetivorans, M. mazei, Mycobacterium leprae, M. tuberculosis CDC1551, M. tuberculosis H37Rv, Mycoplasma genitalium, M. pneumoniae, M. pulmonis, Neisseria meningitidis A, N. meningitidis B, Nostoc sp. PCC 7120, Pasteurella multocida, Pseudomonas aeruginosa, P. putida, Pyrobaculum aerophilum, Pyrococcus abyssi, P. furiosus, P. horikoshii, Ralstonia solanacearum, Rickettsia conorii, R. prowazekii, Salmonella typhi, S. typhimurium, Shewanella oneidensis, Shigella flexneri, Sinorhizobium meliloti, Streptococcus agalactiae, S. mutans, S. pneumoniae R6, S. pneumoniae TIGR4, S. pyogenes, S. pyogenes M3, S. pyogenes MGAS8232, Staphylococcus aureus Mu50, S. aureus MW2, S. aureus N315, S. epidermidis, Streptomyces coelicolor, Sulfolobus solfataricus, S. tokodaii, Synechococcus elongatus, Synechocystis sp. PCC 6803, Thermoplasma acidophilum, T. volcanium, Thermotoga maritima, Treponema pallidum, Ureaplasma parvum, Vibrio cholerae, Xanthomonas axonopodis, X. campestris, Xylella fastidiosa, Yersinia pestis, and Y. pestis KIM.