Our results show that considerable differences in performance are obtained with mirrortree-based methodologies depending on the set of organisms used for building the trees. They also show that it is not always better to use as many genomes as available, as previously assumed. Most of these results have plausible explanations taking into account the type of interaction and the taxonomic distribution of the organisms.
Although the goal of this work is not to compare methods, but organism sets, our results on the performance of the different mirrortree
variants are in agreement with previous studies [13
]. The lower performance of the baseline MT method compared with PC and CM had been already reported and is related to the fact that these two improved methodologies are able to use the information of genome-wide co-evolutionary networks to better detect real co-evolutions as well as implicitly correct phylogenetic biases [13
The fact that, in general, all methods work better as more genomes are used is not surprising as more co-evolutive information is available for them. Nevertheless, it is important to take into account the issues related to phylogenetic distances and redundancy commented below. PC and CM to some extent correct tree similarities artificially increased by the introduction of redundant genomes (strains, etc.) [13
]. That is not the case for MT
and hence this methodology is especially sensible to this and other phylogenetic biases, some of which can be corrected explicitly [10
]. The corrections of all these phylogenetic biases implicit in PC and CM make them to be consistently benefited from using more organisms.
The fact that all methodologies render better results for permanent interactions (macromolecular complexes) had been already reported [13
]. Actually, for MT and PC, the results for the binary and pathways datasets, in spite of being clearly significant and different from random, might not be of practical applicability in certain prediction scenarios (i.e. if a high precision is required). The explanation for the better predictions of complexes could be that the evolutionary pressure for co-evolving is expected to be higher in proteins forced to interact permanently than in those with occasional associations. According to these observations macromolecular complexes seem to act as "co-evolutionary units" [13
Another feature of these macromolecular complexes is that, in general, they represent ancient interactions, compared to transient interactions and functional associations. For this reason, the interaction is expected to occur for all orthologs (interlogs), and hence its associated co-evolutive landmark to be spread through the whole taxonomy. That would explain the observation that better results are obtained for this kind of interactions when including distant organisms within the datasets.
Functional associations and transient interactions are intuitively less prone to yield strong co-evolutions, what would explain the globally lower performances associated to them. Another characteristic of these associations is that, in general, they are "newer" than the macromolecular complexes. It is known that "rewiring" transient interactions is easy and relatively fast in evolutionary terms [25
]. For this reason, it may happen that the orthologs of two proteins participating in a transient interaction in a given organism are not interacting in a relatively distant one (they are not true "interlogs") [26
]. If that is the case, including these "orthologs", which are not interacting and hence not subject to co-evolution, would "dilute" the co-evolutionary signal. This would explain the fact that, for these types of interactions and associations, better results are obtained when using only close organisms, since the interaction is expected to be conserved on them, while it might be absent in taxonomically distant organisms. In other words, many of the E coli
pathways and transient interactions we are evaluating might be new and hence specific for this microorganism and its close neighbors, and hence the eventual co-evolutions associated to them would be apparent only in these particular genomes. Interestingly, a similar relationship between the "age" of the interactions, their conservation across the taxonomy, and the resulting optimal set of organisms has been reported for the "phylogenetic profiling" method [15
In some cases it is difficult to disentangle the factors contributing to a given result, for example number of organisms vs. taxonomic criteria used for selecting them. Moreover, it is difficult to quantify and numerically assess the differences of the ROC curves we are using for evaluating performances. For that reason, these curves are evaluated qualitatively and the conclusions presented are based on general trends observed for many curves, instead of particular cases.
A future study aimed at obtaining more insight into the relationship between organism sets and performance should include samplings according with other taxonomic criteria (as well as combinations of them: i.e. combining "nearest"+"level"), and a detailed study of the particular interactions detected and not detected in each experiment (their functional classes, etc).
In the next section, we propose some recipes for the users of these methodologies derived from these results. We plan to implement some of the recipes obtained for the MT method in its recently developed web server [9