Reviewer 1: Dr. Igor Zhulin (Oak Ridge National Laboratory, USA)
I have conflicting views on this paper. On one hand, I have read Introduction, the beginning of Results & Discussion (the authors lost me half through this section though as it become very descriptive and I had a hard time connecting the pieces), and Conclusions with a great interest. The topic is fascinating and the amount of work that has been done is unbelievable. The authors analyzed an enormous amount of data, both published and results of their computational research, and presented not only a catalog of proteinaceous toxin systems, but a multi-scale picture of their roles in various biological processes. On the other hand, it all came at a high price of lacking necessary details regarding computational analyses and focus. I perfectly understand that presenting such a huge amount of information requires sacrifices in some areas, but I do not think that it should be in describing “experimental procedures”. It is a generally accepted policy in science that procedures must be presented in a sufficient detail, so experiments can be independently reproduced. This paper, in my opinion, does not fulfill this requirement. The section “Search strategy to identify new toxins and immunity proteins”, which serves the purpose of providing such details, gives only a very general description.
Authors’ response: We have altered the Material and Methods to provide more extensive details regarding the procedures we followed with respect to sequence and structure analysis. We do not agree with the referee’s statement that experimental procedures have been sacrificed. In essence all the sequence and structure analysis was performed using publically available programs, which have been published and are well-known in the computational biology community, if not more widely. In the current version of the Material and Methods we describe these without omission and any reader with access to appropriate computer resources can use the same. We also disagree with the referee’s allegation of the lack of sufficient information for independent reproducibility – see below for further details in this regard.
Finally, the length and overall organization of this paper makes it very difficult to follow it through and the lack of page numbers is inexcusable for a manuscript that has 130 of them. Nearly each of the 38 subchapters of this paper has its own introduction and reads as a separate story. As a result, we do have an encyclopedia of polymorphic toxin systems, but its true scientific quality is hard to estimate.
Personally, I would rather see much smaller pieces of this work presented in a concise way with all details of searches and analyses clearly shown. The global view that authors aimed at presenting is much better suited for review papers. Here we have a lot of original work mixed up with a review of literature: the number of references in this paper is higher than in many comprehensive reviews on similar topics. I think the quality of both original work and review suffers from this mix.
The bottom line is that to me this is a paper that reaches very interesting conclusions, but which is very difficult to comprehend in its entirety and some (if not many) of its results cannot be verified (or are very difficult to verify) independently.
Authors’ response: We regret the inconvenience caused by the lack of page numbers, which stems from using a PDF reader which provides the page numbers as against a print version. The referee raises three basic issues which we address below-
(i) Length of the article – single long versus multiple short papers: Short articles are useful when a single domain or computational observation needs to be succinctly presented. Indeed, upon our initial discovery of these systems we published two shorter articles outlining just the details of specific aspects of them. However, upon further investigation it became clear that neither those two works nor subsequent experimental studies on these systems really do justice to the magnitude of domain diversity seen in these systems. Unlike many other systems, despite these proteins being around and accumulating in the non-redundant protein database for now more than a decade, there has been hardly any comprehensive study on them. This is testified by the rather rudimentary annotation borne by most of them in protein databases. This being the first such treatment on a long-neglected class of highly represented proteins meant a particularly long paper. Furthermore, the practical aspects of publication meant it was quite infeasible to prepare numerous separate small papers and submit each for peer-review. We realized in course of our study that splitting the individual discoveries into multiple manuscripts would dilute the big picture emerging from these systems. With respect to shorter works being easier to read than a comprehensive manuscript as this we opine that it is largely a matter of taste. It may be noted that referee two, despite finding the length remarkable, commented regarding its easy readability. The apparent self-sufficiency of the sub-sections is primarily to help readers who might be more interested in one or few of toxin or immunity domain families but the text has been edited to minimize redundancy. Hence there is no repetition of material between sections.
(ii) Review versus original paper admixture: We disagree with the referee in saying that it is a mixture of review and original research. The “review” aspect is limited to the introduction and general conclusions, as is typical of any research paper. It should be kept in mind that any kind computational analysis work based on sequence/structure analysis needs to place newly identified domains in the context of what is already known in order to make new functional predictions. This is exactly what we do – this necessitates the mention of previous studies and also precedence of biochemical activities for functional inference. We do not see this as being a mixture of review with new results but merely an aspect of building a functional argument. While there are several domains and ideas presented in this study, we were particular in only emphasizing those that are novel and discovered in this study. In our calculation, ~ 85% of our dataset (that has about 250 toxin and immunity domains) is not found in any domain database. Those that are already present in protein domain databases like PFAM, they are typically listed as domains of unknown function (DUFs) and are need of functional annotation.
(iii) Reproducibility: As noted above, we do not accept the claim that our results are not reproducible. Of course, the ease of reproducibility depends entirely on the time available to one attempting it. We should emphasize that all the computational discoveries reported here use standard sequence/structure analysis techniques laid out in the Material and Methods, as is typical of a paper in this field. Those cases involving more difficult detections we explicitly mention in the paper program used and statistical support for the particular relationship or the Z score cutoffs used by DALIlite for structural relationships. Since we have provided Genbank identifiers (gis) for the prototypical proteins of every group, all the remaining relationships can be reproduced by running profile searches with PSI-BLAST, HMMsearch3, JACKHmmer or HHpred on the Web or locally, either in a unidirectional or transitive fashion. Most importantly we have provided one of the most extensive supplements for a sequence/structure analysis paper -- alignments for each toxin and immunity domain have been provided; hence, obtaining starting points for reproducing searches should not pose any difficulty. The gis of all proteins under consideration are also provided along with an appropriate classification. This allows for independent verification of architectures and operonic associations. In addition to the extensive tables in the body of the article which provide details regarding active sites and phyletic patterns, the data is also provided in the supplement as searchable tables, where readers can browse the data by species, domain, operons, and pathway of secretion. We fear the referee did not peruse the extensive supplement that provides all the material for reproducing the presented analysis. In the revised version we have further improved the presentation of the supplement to improve ease of access to the alignments. We will also upload all the new alignments to protein databases such as Pfam making the material available upon publication to facilitate easy reproduction and use of the presented results.
Reviewer’s reponse to above:
I am not persuaded with authors’ arguments regarding their description of “experimental procedures”.
Let me consider just the first paragraph of Materials and Methods, which is shown below (in italics) in its entirety and is fragmented only by my interjections.
As described in the search strategy, protein sequences corresponding to predicted toxins, trafficking, presentation, processing and immunity domains were isolated using diagnostic domain architectures and gene-neighborhood templates, that were initially identified in previous studies
Figure ). These domains were then used as seeds in iterative profile searches with the PSI-BLAST
]programs that run against the non-redundant (NR) protein database of National Center for Biotechnology Information (NCBI), to identify further homologs.
This is a very general statement, which provides very little detail. Cleary, each PSI-BLAST and JACKHMMER search is carried out not with “domains”, but with one concrete protein sequence, which has a name and coordinates of the region that was used as a query.
Authors’ response: We concede that the word domain in this context might be confusing for some readers. However, it is should be noted that in this context we obviously imply the amino acid sequence corresponding to a given domain. This point has been emended.
A search is performed against a specific database of a certain size and content. The size of NR database has doubled in less than 3 years and is changing every day. Thus, it is important either to work with a fixed version of NR or to report which version was used in a given search. Here is the excerpt from the authors’ own work, which provides a good example of how “experimental procedure” should be described:
“A PSI-BLAST search was initiated with the conserved N-terminal extension of the SGC (human SGC1β, gi: 4504215, region 1–360), using an inclusion threshold of .01, and compositional bias based statistics to eliminate false positives arising due to peculiarities of sequence composition. Both the N- and the C-terminal parts of this extension gave several distinct hits to different bacterial proteins, supporting the presence of two distinct globular domains in this extension. Based on these hits we divided the extension into N- and C-terminal parts and initiated separate PSI-BLAST searches with them. Searches with the N-terminal part of the extension gave significant hits to bacterial proteins of the length 180–195 residues within the first 3 iterations (eg. Mdge1313 from Microbulbifer degradans is detected with an expect-value (e) of 10–4 in the first iteration)…” (LM Iyer, V Anantharaman and L Aravind 2003 BMC Genomics 2003 4:5)”. Although some details are still lacking and the NR version was not specified (not that critical for the year 2003), this description is thorough enough to reproduce the steps that were taken during the domain identification process. I regret that ten years later authors think that providing search details is no longer necessary. Once again, I understand the reason for not providing details for numerous searches that they have carried out, and once again I disagree with this position.
Authors’ response: We appreciate the referee quoting from a former work of ours. Obviously we have neither forgotten nor changed our philosophy to domain discovery or analysis in the past 8 years. We note that the referee states that he understands why we do not give these details in the same manner as it is done when reporting the discovery of a single/few domains. We should reiterate that when such an analysis is scaled up to hundreds of domainsf providing descriptions as that pasted by the referee would result in an extraordinary and tedious prolixity for most readers (users) of the article. Hence, the report in the actual manuscript focuses on the points of biochemical/biological interest with only a general description of the search strategy for most cases. This does not mean that the issues raised by the referee are inaccessible. They are simply provided in the supplementary material. Herein a reader might find a collection of the actual saved PSI-BLAST searches for all the notable domains described herein. The same files should supply the specifics of the nr database at the point of the run. Furthermore, another file in the supplement provides the query gi with sequence coordinates of all seeds used for the domain-specific searches. Yet another file provides the searches with all the profiles, which we created for this work (either PSI-BLAST or HMM) against the NR database from May 23rd2012. The links have been made explicit in the additional file.
Referee’s comment resumes: For most searches in which were used to report the relationships presented in this work a cut-off e-value of .01 was used to assess significance.
Let us leave alone the fact that something is missing from this sentence (what were used?) and focus on the main point. This statement means that for some searches a cut-off E value other than 0.01 was used.
Authors’ response: This sentence had a typo which we have now corrected and appreciate the referee pointing the same.
FOR WHICH ONES? WHY? No details provided. Furthermore, 0.01 is already a “dangerous” level, when it comes to false positives. The description provided by authors leaves a possibility that some searches were carried out even with a worse E value. It does not automatically mean the results are incorrect, but it does mean that a special care must be taken when considering such relationships and description must be provided.
Authors’ response: The .01 cutoff is dangerous only in the hands of the untrained sequence analyst. Obviously we took special care to manually examine every iteration of searches with every domain reported in this study. Thus, we ensured that the new sequences being included are unlikely to be false positives.
Referee’s comment resumes: This was further confirmed with other aids such as secondary structure prediction and superposition on known structures, if available. For each toxin or immunity gene, the gene neighborhood was also comprehensively analyzed using a custom Perl script of the inhouse TASS package. The process was carried out iteratively and exhaustively and resulted in the identification of over 250 toxin and immunity domains.
I am guessing that the first sentence refers to assessing the validity of multiple sequence alignments (which is described in the next paragraph). This indeed is a common technical element, which requires no further description. However, the next sentence makes quite a difference. What is meant by “comprehensive analysis of the gene neighborhood”? How many genes in the vicinity of the gene of interest were analyzed? How were they analyzed: by their RefSeq annotation? COGs? Best BLAST hit? Gene neighborhood analysis is a very important element of computational genomics of prokaryotes; however, there is no publically available, published program or even a single, commonly accepted approach on how to do this analysis. Thus, it is important to provide details.
Authors’ response: The Material and Methods have emended to include further details on neighborhood analysis.
“The process was carried out iteratively and exhaustively…” Which process? The entire process of domain identification or only the PSI-BLAST searches? I understand how the latter can be done iteratively and exhaustively, but I can only guess what it means with respect to the entire process, and certainly cannot distinguish between these possibilities.
Authors’ response: The Material and Methods have emended to remove the potential confusion arising from this statement.
In response to my original critique authors replied that they “do not agree with the referee’s statement that experimental procedures have been sacrificed. In essence all the sequence and structure analysis was performed using publically available programs, which have been published and are well-known in the computational biology community, if not more widely”. In essence, yes, but in some cases, obviously, no: a custom Perl script of the in-house package… Custom scripts execute specific actions. We do not need to know what the script is, but we certainly do need to know what the action was. “Comprehensive analysis of gene neighborhoods” to me is a prototype example of sacrificing the description of “experimental procedures”. Even when it comes to publicly available and published tools, procedure details should be provided. In experimental biology, it is not enough to state that PCR was used to amplify a given gene – exact primers must be provided. Perhaps, this is not the best analogy, but it illustrates the point.
Authors’ response: The Material and Methods have been emended to describe the action of the script which in essence provides the details pertaining to the gene-neighborhood analysis raised above.
On the final note, I would like to emphasize that I have an utmost respect for the authors, who have been leaders in the field for many years now, and who produced a series of groundbreaking papers in computational genomics. Without doubts, their results and conclusions are both correct and important. Furthermore, I applaud their decision to submit all domain models to the public repository (Pfam). However, I do disagree with their position on attention to detail in describing “experimental procedures”. I can expand on this point substantially; however, this is not the place for such a debate.
Authors’ response: We too believe that this is not the place for a general debate on methodology.
Reviewer 2: Dr. Arcady Mushegian (Stowers Institute for Medical Research, USA)
The manuscript by Zhang et al. is a magisterial treatment of a large and heterogeneous group of bacterial complex toxin proteins as well as the immunity proteins that countervail the action of these toxins. It is a comprehensive collection of old and new protein families, genome contexts and phyletic distributions of these important functional modules in prokaryotes, which also crosses over to partially analyze the sequence relationships of secretion systems in bacteria. I have no concerns about the quality of sequence comparison, domain definition and genome context analysis. This is a catalog of novel predicted functions, which can guide the work of experimentalists for years to come. I do have, however, several small concerns about data presentation and some comments that have to do with the broader discussion of bacterial evolution. More specifically:
Authors’ response: We thank the reviewer for his positive comments and suggestions.
p. 21–22: a few homologs of multidomain polymorphic bacterial toxins are purported to be present in eukaryotes (e.g. gi 321474287 in Daphnia and Tox-REase-8 in a subset of insects), and it is surmised that they have been horizontally transferred from bacteria. How do we know that these genes are indeed found in the genomes of these eukaryotes, and do not represent endosymbiont DNA or other contamination? Have the genomic contigs been assembled, do these genes display eukaryotic features - e.g., introns?
Authors’ response: In our analysis, we were particularly careful in eliminating false assignments of lateral transfer to eukaryotes and used several parameters to decide if the laterally transferred genes were indeed encoded by the eukaryotic species. In the simplest scenario, the presence of introns was indicative of their eukaryotic presence. For example, the gene for gi 321474287 in Daphnia contains 11 introns, whereas most Tox-REase-8 genes in insects at least contain one intron, eliminating the possibility of these genes being contaminants. Other parameters that were considered include: 1) Elimination of sequences that were identical or almost identical to bacterial sequences. In our dataset, none of the proteins assigned as laterally transferred showed any identities or near identities to bacterial sequences; 2) Most proteins assigned as laterally transferred to eukaryotes also showed a presence in more than one eukaryotic species, which further helps in eliminating false lateral transfer assignments. For e.g. Tox-REase-8 is present in crustaceans, insects and placozoans. Similarly, Tox-GHH domains are present in five major lineages of bacteria, while in the eukaryotes they are only found in multiple metazoan species (TCAP domains of teneurins). In response to this comment and to that made by Reviewer 3, we have explained this procedure in more detail in the Materials and Methods.
p. 44–45. The gene neighborhood network shown in Figure : I am not sure what it is supposed to visualize. The authors state that the direction of the edges is important, i.e., it shows the 5' to 3' order of genes or protein domains; but the arrowheads are barely visible even in the pdf at magnification 250%, and will not be seen online. In any case, the edge density is so high that the main message seems to be 'anything can link to anything'. The graphs become more sparse when clade-specific connections are shown - this is more interesting, but perhaps visualization would be better if the density of connections is modeled by the edges of different thickness.
Authors’ response: We agree with the reviewer that the full view of the domain architectural network was too dense for a detailed view. We have now added a simplified graph next to the central graph that further combines all nodes into metanodes based on their functional type. This simplified graph gives a better view of the follow on connectivities across all toxin polypeptides. For example, it clearly shows that toxin domains detected in this study are almost always at the C-terminus of the protein.
The next several comments have to do with somewhat superficial and inconsistent discussion of relative plausibility of various evolutionary scenarios.
p. 46 "The phyletic pattern of this system suggests that it might have emerged inthe proteobacteria-bacteroidetes assemblage (members of the group I bacterial division [183
]) followed by transfer to a subset of group II lineages such as negativicutes and fusobacteria." --- Why not the other direction, or ancestral origin followed by gene losses (especially given that these scenarios are discussed later for essentially the same phyletic vectors)?
Authors’ response: The above argument is based on parsimony. In this study, we notice a strict correlation between the occurrence of T5SS and the presence of an outer membrane. Most lineages from Group I bacteria (including all proteobacteria and bacteroidetes) contain an outer membrane and also components of T5SS. In contrast, most lineages of Group II bacteria contain only one membrane layer around the cell further encapsulated by a cell wall. Some exceptions include the negativicutes which are a subset of firmicutes that have an outer membrane. Since the ancestral state of the Group I and Group II bacteria can be generally reconstructed as possessing an outer membrane in the former and containing a single membrane layer in the latter, we propose that the T5SS were laterally transferred to the negativicutes and fusobacteria .We have added an additional remarks in this regard in the revised manuscript.
Referee’s further response: The explanation is fine in this case, but compare it to the following point-counterpoint.
p. 52–53: "This general rarity of the polymorphic toxin systems is in striking contrast to the general prevalence of the toxin-antitoxin systems across archaea [22
]. This distribution, with a dominant presence in most major clades of both group-I and group-II bacteria, suggests that polymorphic toxin systems could have been present in the ancestral bacterium." --- First, what is meant by "this distribution"? My understanding is that "this distribution" includes "general rarity" of polymorphic toxins in archaea. How can rarity of a system in archaea suggest its presence in bacterial stem, as opposed to later invention in bacteria? I suspect that this is mostly unfortunate wording that should be edited. In contrast, my second concern is more fundamental: essentially, any phyletic distribution may be interpreted as 1. ancestral presence of a gene followed by gene losses, or 2. later invention in one clade followed by horizontal transfers to to the other clades; or else 3. some combination of ancestral presence, losses and HGT. To turn these scenarios from mere hand waving to something supported by the evidence, one has to specify the model of gene gain and gene loss more explicitly, or to bring in some auxiliary evidence that favors one of the explanations. I do not see much of this here.
Authors’ response: We agree that this section was a bit unclear and we have now revised it. Similar to the previous point, the polymorphic toxin systems that we report in this study are present in all major lineages of bacteria. While there is no denial that extensive lateral transfer of these systems occurs, the presence in the ancestral bacterium with divergence mirroring the evolution of different secretion systems within the bacterial superkingdom is a parsimonious argument. In contrast only a few archaeal “species” contain these systems suggesting that they were probably not present in the ancestral archaeon. Parsimoniously, this suggests that the few archaeal polymorphic toxin systems were acquired from bacterial versions, because alternatively it would require a large number of gene losses in different archaeal linaeges.
Referee’s further response
: In the previous exchange, the presence of a gene at the root of group I only, but not at the root of group II nor at joint root of I+II, was called “parsimonious”. Now, presence at the root of all bacteria is believed to be parsimonious, when the same set of taxa is examined. What kind of parsimony is invoked in each case? (I think I can discern the answer from the next two sentences, but please correct me if I am wrong). The authors appear to understand parsimony as the explanation that requires the smaller number of events. I cannot accept this as an always-preferable explanation, when it does not matter what these events are and how are they counted; in a moderate form, however, we can use parsimony as a criterion of selecting the null hypothesis,
i.e., “choose the scenario with the smallest number of events, unless the additional evidence suggests that a more complex scenario has to be considered”. I think that, in this case, however, precisely such additional evidence is available in the form of evolutionary estimates of the relative rate of gene gain and gene loss: almost every estimate suggests that on average gene losses are moderately to highly more frequent than gene gains. So, unweighted parsimony will not work in these cases – a scenario with 1:1 gain-to-loss ratio will be actually making an additional assumption of a relative loss rate that is constrained to be lower than what is observed in nature. Everything is then hanging on the word “large” – how large the excess of losses in archaea is, so that this makes the scenario so unlikely?
Authors’ response: We agree that the general frequencies of gene loss tend to exceed those of gains. However, with respect to the toxin systems in archaea we are dealing with the following situation: The non-redundant database has representatives from over 225 completely sequenced WGS sequences. Classical polymorphic toxin-like systems are found only in about 15 of them. Thus, there are approximately 15 times the archaeal genomes which lack these as those which have these systems. Approximately more 1/3rdof the bacterial genomes have at least one such system. Hence, although the referee is right in pointing to the differences in the rates of loss exceeding gain, we believe our original reasoning based on the parsimony argument is a valid one.
Referee’s further response:
This is also supported in phylogenetic trees, where the archaeal toxins or immunity domains group with particular bacterial versions.
Is this true for the trees of all families, or only some?
Authors’ response: Baring the barnases where the relationship is difficult to ascertain one way or another, consistently the other toxin domains shows the archaeal branches embedded within the bacterial radiation.
p. 53, the following sentence: "However, it should be noted that these genes and cassettes are highly prone to lateral transfer as suggested by the sporadic phyletic distribution of both toxin domains and immunity proteins [17
]. Hence, the distribution of these systems might also reflect in part the secondary dispersion of such systems across diverse bacteria by lateral transfer." --- Essentially, this is the same as to say that inheritance of any genetic element may be either vertical or horizontal. So?
Authors’ response: While the sentence might on the surface appear trivial but needs to be seen in light of the earlier comment on the polymorphic toxins being inferred present in the stem of the bacterial superkingdom. While that inference can be made based on the distribution of the toxins and their corresponding secretion systems, we intended to provide a more realistic picture (the above sentences), lest it be taken that their evolutionary history was predominantly vertical since their emergence early in bacterial evolution.
Referee’s further response: Once again, in the exchange regarding the statement on p. 46, the inference was that certain toxin was present in the step of proteobacteria+Bactoroidetes, but not in the stem of all bacteria. I suppose the scenarios are really different for different toxins – can this be made more explicit?
Authors’ response: The toxin distributions in bacteria are certainly affected by lateral transfer so we cannot be certain of the inference of particular toxin in the common ancestor. Nevertheless, based on the differential distributions, we can tentatively propose that some of the widespread versions, such as the barnase, HNH and deaminase domain toxins might have been present in the stems of the major bacterial clades such as those uniting the group-I bacteria or group-II bacteria.
p. 53: "Certain patterns of distribution of polymorphic toxin systems appear to transcend phyletic boundaries… 1) the hyperthermophiles, which are often chemoautotrophs, from both bacteria and archaea show a strong tendency to lack such systems." --- this seems to be the case of multiple losses in bacteria, possibly favored by similarity in the habitats, and possibly ancestral absence in archaea. Ecological adaptations like this 'transcend phyletic boundaries' more or less by definition - is this the point?
Authors’ response: While adaptations directly related to an ecological niche are indeed obvious in terms of transcending phyletic boundaries, this is not necessarily the case with inter-organismal conflict systems, which do not directly relate to the ecological niche. Since we nevertheless found correlations between these systems and ecology, we felt it would be useful to point them out. This would help understanding the more subtle effects of ecology of a species on their interactions with conspecifics and other organisms.
Referee’s further response: The correlation has been observed between hyperthermophily and lack of polymorphic toxins. As the authors imply, this may in fact be the correlation between chemoautotrophy and lack of toxins – or is it? Which effects here are gross, and which are subtle? Could it be, for example, that hyperthermophily is generally correlated with reduced repertoire of all kinds of secreted proteins, which would be more easily destabilized and inactivated by adverse environment outside the cell?
Authors’ response: We agree that the point raised by the referee regarding temperature affecting protein stability and thereby placing a selective constraint on the number of toxins could be in principle a valid alternative explanation. However, beyond certain compositional and length distribution differences the total number of secreted and membrane proteins in hyperthermophiles do not appear to be significantly different from other organisms (e.g. Nilson et al. Proteins. 2005 Sep 1;60(4):606–16.) Hence, we are not certain if this explanation might be more relevant than autotrophy, which additionally accounts for the comparable situation in photosynthetic autotrophs.
p. 56: in the case of oral microbiomes, I am not sure how some species were assigned to 'biofilm-forming' category and others to 'cheaters' - I think that at least some species in the latter category are biofilm-forming in their own right.
Authors’ response: As pure cultures, all these species are likely to form biofilms, but the oral environment is a mixed population of diverse bacterial species, and it is well known that oral biofilms are comprised of mixed bacterial species (Paster BJ et al. Bacterial diversity in human subgingival plaque, ref 198). In this context, we hypothesize that the number of toxin and immunity domains predicts how a species will interact with another one during the formation of a mixed biofilm.
Reviewer 3: Dr Frank Eisenhaber (Bioinformatics Institute, Singapore)
I agreed to be a reviewer when reading the author list only to find out that MS is by far the longest that I have ever seen as reviewer in my life. Despite of the initial horror and of the impressive length, the text is a fine reading - both as a research paper and as a review of this specific field. One would not think to shorten it by a page. The thoughts and results are plausible (there is no hope to repeat the calculations even partially). There is considerable care for the detail throughout the text, figures and additional files (except for very minor things such as ref. 144 appearing incomplete).
I find the generous addition of supplementary information especially notable.
Possibly, this will be of greatest benefit for people creating annotation pipelines and sequence databases. For practical purposes, the authors might think to add archives with all the individual alignments in single files and domain models in several formats such as the HMMR2, HMMER3, etc. ready made.
I think that the work is a welcome addition to the scientific literature.
Authors’ response: We thank the reviewer for his positive comments and suggestions. A more user-friendly supplementary file is now provided with the alignments of the toxins and immunity domains as separate files in a zipped format. We will additionally upload all alignments to protein domain databases such as Pfam, so that researchers can access them more easily. Ref. 144 has been updated in the revision.