Given the problems described above, how then can it be that the numerous comparisons with other data presented by Lu and co-workers all point in the direction that their lists are better than existing ones? The answer lies in the subtle but important distinction between 'cell-cycle function' and 'cell-cycle regulation'. Figure of this Correspondence exemplifies the difference: whereas all six genes are involved in the cell cycle, only four of them (Plo1
, and DBF2
) are transcriptionally regulated during the cell cycle. Many of the tests performed by Lu et al
. to support the validity of their proposed cycling genes do not assess cycling expression per se
. Datasets from conditions such as stationary-phase budding yeast, nonproliferating human tissues, developmentally arrested A. thaliana
and nitrogen-starved fission yeast are measures of downregulation in non-proliferating cells, which do not necessarily correlate with cyclic expression. The problem is that any gene involved in the cell cycle should be down-regulated under these conditions -whether it is expressed in a phase-specific manner or not. The authors also analyze the enrichment for essential genes and genes annotated with relevant Gene Ontology terms; however, no statistical analysis can change the fact that these are inherently related to the phenotype or function of a gene rather than to its regulation. The vast majority of the comparisons by Lu et al
. only show that their set of conserved cycling genes is enriched for genes with cell-cycle function, but not that they are subject to transcriptional cell-cycle regulation. Indeed, we have previously observed that methods with good performance on a benchmark set based on functional evidence often perform poorly on more reliable benchmark sets based on regulatory evidence [22
Lu and co-workers [9
] also compared their list of cycling genes from budding yeast with the targets of nine cell-cycle transcription factors [23
]. This is, in our view, a much better gold standard as it is based on experimental evidence that is directly linked to cell-cycle regulation and not to cell-cycle function. However, this benchmark showed that the list proposed by Lu et al
] and the original list proposed by Spell-man et al
] are equally enriched for targets of cell-cycle transcription factors. Similar benchmarks based on regulatory evidence from the three other organisms even suggest that transfer of evidence between homologous genes leads to a decrease in performance [17
In summary, homology-based transfer of expression data and other experimental evidence is a powerful strategy for function prediction [25
], as protein function is often conserved over long evolutionary distances [20
]. However, several studies have shown that the regulation of genes and proteins changes much more rapidly during evolution than their function [4
]. We have previously shown that, despite the lack of conserved regulation at the single-gene level, the organisms regulate the same protein complexes, but do so via different subunits [14
]. By transferring cell-cycle expression data between distantly related genes, Lu et al
. were thus able to identify genes with cell-cycle function that cannot be identified as such on the basis of the expression of the genes themselves (for example, fission yeast mcm3
; Figure ). Selecting the correct evolutionary timescale for the property in question - be that function or regulation - is the key to success for any homology-based method.
Yong Lu, Shaun Mahony, Panayiotis V Benos, Roni Rosenfeld, Itamar Simon, Linda L Breeden and Ziv Bar-Joseph respond:
Despite claims to the contrary from Jensen et al
., previous analyses of cell-cycle expression data resulted in opposing views regarding the conservation of expression between different species. While some investigators have concluded that this conservation is surprisingly low [4
], others have determined that it is rather large. For example, Oliva et al
] found that more than 30% of top cycling genes in budding and fission yeast are cycling and conserved in both species, and Ota et al
] identified more than 15% of cycling human genes as cycling and conserved in plants and yeast. The major reason for this discrepancy seems to be the use of strict thresholding for determining whether a gene is cycling or not. Such an analysis on a species-by-species basis may lead to inconsistencies in cell-cycle assignments. Figure of this Correspondence exemplifies this difficulty. While only expression of the human Mcm6
gene was determined to be cycling by Jensen et al
], as Figure shows, its curated homologs in budding and fission yeast (which were annotated as non-cycling by Jensen et al
.) actually display strong cyclic expression patterns. This is a general problem with cell-cycle analysis. As Figure shows, while some orthologs of cycling budding-yeast genes may fall just below the fission-yeast threshold, they are still (at least weakly) cycling, significantly more than expected by chance, indicating that expression is conserved at a stronger rate than the rate determined by thresholding. To address these issues, we have developed a new method for combining expression data from multiple species [9
]. Using our method we concluded that cell-cycle expression is conserved at much higher rates than those claimed by Jensen et al
Figure 2 Expression values for MCM6 in humans, budding yeast, and fission yeast. Values are log ratios between synchronized and unsynchronized cells. (a, b) Expression profiles of budding yeast MCM6 under different cell-cycle arrest methods [2,3]. (c, d) Expression (more ...)
Figure 3 Score distributions for fission-yeast genes that are ranked below the cycling score threshold. The red curve is the distribution of 350 fission-yeast orthologs of cycling budding-yeast genes. The black curve is the distribution of all the other 3,641 (more ...)
The central claim Jensen et al
. raise in this Correspondence is that our method is circular. We believe that they confuse assumptions with circularity. Any computational method relies on specific assumptions and, if these assumptions are wrong, the conclusions of that method may be wrong as well. For example, sequence alignment relies heavily on assumptions regarding the parameters used for match, mismatch and gaps. As Dewey et al
] nicely show, these parameters can have a big impact on the results of aligning non-coding regions. Nonetheless, researchers have been using these methods for a long time with specific parameter choices and have arrived at very specific biological conclusions. Like our method, these findings are dependent, at least in part, on the choice of parameters for matches that are directly related to the conclusions drawn. Yet they have proved both useful and accurate when validating with independent data.
This is exactly the case for our method. It does not rely on circular logic; rather, it uses very specific and widely accepted assumptions. We assume that if two genes have very similar sequence, it increases the likelihood that they perform a similar function. This is the assumption researchers make when using BLAST. When applied to our problem this translates to increased likelihood that genes with a similar sequence share similar cyclic status (either cycling or non-cycling). Note that this assumption is not binding and is only secondary to the actual observed expression values, as we show in Figure . Still, as with any other method, we need to decouple our results from our assumptions to demonstrate that our findings are indeed correct. We highlight below the supporting evidence in which we were very careful to control for sequence similarity.
Figure 4 Comparison of expression score ranks and posterior ranks. (a) The expression score rank and posterior rank for fission-yeast genes. The x-axis is the expression score rank (the lower the rank the more cyclic the gene is determined to be by the scoring (more ...)
One of the major difficulties in identifying genes whose cell-cycle-regulated transcription is conserved across evolution is that cell-cycle microarray data are noisy and often contradictory. Jensen et al
] identified the top 300 periodic transcripts from each of four human datasets and found only 63 transcripts in common to all four. With only a 20% overlap between the most periodic 300 transcripts in four data-sets from the same organism, there is little doubt that a comparison across four highly diverged species is problematic. The approach of Jensen et al
] was to use thresholds that are "more conservative than those originally proposed" and to analyze a smaller, more reliable subset of cyclic transcripts. Our goal was not to exclude, but to capture as many cyclic transcripts as possible, with the view that interesting candidates could be subjected to further verification.
Our approach was motivated by the plot in Figure , which shows that fission-yeast orthologs of cycling budding-yeast genes fall just below the fission-yeast threshold for periodicity far more than expected from chance (p-value < 0.01 using Wilcoxon rank-sum test, p-value < 0.03 using Kolmogorov-Smirnov double-sided test). We have attempted to capture these borderline genes by lowering the threshold for borderline genes if their homologs in other species are cyclic and raising them if they are not cyclic. This strategy will certainly lead to more false assignments, but it has also allowed us to identify hundreds of promising candidates for further investigation. Still, almost all the genes that are elevated to a cyclic status by our method have a rather high cyclic expression score to begin with. Figure shows the difference between the initial score (based on expression alone) and the posterior score from our method. As can be seen in the plot, the ranks for most genes do not change much.
Jensen et al. also question the complementary datasets we used to validate the CCC sets identified by our algorithm. They claim that the complementary datasets we used only point to cell-cycle function rather than cell-cycle regulation. However, the 'functional rather than regulatory identification' claim does not provide an explanation as to how our algorithm was able to identify these 'functional' cell-cycle genes. In our analysis we used controls for both types of data (expression and sequence). Specifically, for the essentiality analysis we show that only 16% of cycling yeast genes are essential. If one uses sequence data, so that only genes with conserved homologs in other species are retained, this percentage increases to 27%. If what we find is indeed functional rather than regulatory signal, cyclic expression in other species would not have been a factor and the only advantage we would have would come from using sequence data. However, when we use both sequence conservation and conserved cyclic expression, as determined by our method the percentage rises to 46%, a more than 70% increase over sequence alone. Similar results were obtained for the human conserved set. We have repeated this type of positive control for the other types of complementary analysis and have shown that expression conservation leads to much stronger cell-cycle characteristics.
We have also carried out direct regulatory analysis. Table 1 in our original paper [9
] presents the result of motif search methods for genes in CCC2, the set of cycling genes conserved between the two yeasts. We show that these genes have a remarkably well conserved motif for G1 and some of the S-phase transcription factors. In sharp contrast, non-cycling homologs of genes in CCC2 do not have these motifs conserved. The fact that motif conservation agrees with our expression conservation findings is a strong support for the CCC2 set assignment.
The other major issue raised here by Jensen et al
. relates to the problem of identifying conserved periodic genes whose products carry out the same function in all four of these highly divergent species. Jensen et al
] used a combination of sequence similarity and manual curation to identify orthologous groups. In most cases, it cannot be determined whether these groups are really functionally equivalent or whether all such groups have been identified. Nevertheless, on the basis of these assignments, only a quarter of all the cycling genes they studied had orthologs in all four species and these form the basis for their comparison. Of the 60 cycling genes in Arabidopsis
with orthologs in all four species, one-third of their orthologs also cycle in pairwise comparisons with each of the other three species, but only five cycle in all four species. All five of these orthology groups represent well studied genes and nothing new was identified.
We purposely avoided restricting our analysis only to genes with clear orthologs across species. Rather, we used BLAST analysis followed by a Markov cluster algorithm [34
], which leads to the identification of multi-domain homologous proteins. This difference between the definitions of homologs impacts on the conclusions reached by us and Jensen et al
. Our method results in large families that show high homology overall but cannot be parsed into one-to-one orthologous pairs across species. In our original paper [9
], we presented analysis of the results of this procedure for the CCC2 set of conserved cycling genes. We found that 82% of budding yeast genes in CCC2 are indeed curated homologs of the fission yeast CCC2 genes [35
], a very high rate that indicates the accuracy of the resulting CCC2 set.
As we compare the genes from more divergent species, we are much less likely to be able to ascribe functional equivalence to any given pair. This is especially true for signaling and regulatory proteins that often arise from duplicated genes, and which cannot be forced into functionally equivalent orthology groups until we have a complete understanding of what they do in every species. Jensen et al
. are correct that there is no cyclin E ortholog in yeast. There is also no cyclin E in Arabidopsis
]. However, all four species encode related cyclin genes carrying out functions in late G1 that are important for the transition to S phase, and most of these cyclins are cell-cycle-regulated at the transcriptional level. These are the very types of gene products that we are most interested in identifying.
Towards this end we used an objective and comprehensive strategy for identifying multi-domain sequence homologies across all four genomes. In so doing, we have identified groups of genes that share some truly remarkable properties. The 72 conserved cyclic budding-yeast genes that are also conserved in fission yeast and humans (CCC3) are eight times more likely to be targets of cyclin-dependent kinases than those tested at random, and six times more likely to be involved in protein-protein interactions. Some of these genes encode unexpected proteins (for example, alkaline phosphatase and metal transporters) and there are others about which nothing is known. To further study this set we carried out new experiments [37
] to identify the set of cycling genes in primary human cells (our previous analysis as well as that analysis of Jensen et al
] is based on expression data from transformed (HeLa) cells). As we discuss in [37
], the set of genes cycling in primary cells is significantly more enriched than the HeLa set for orthologs of cycling genes in budding and fission yeast. We hope that our study will spur the collection of more cell-cycle data and the development of new strategies for identifying conserved periodically transcribed genes.
Correspondence should be sent to Ziv Bar-Joseph: Department of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA. Email: zivbj/at/cs.cmu.edu