RNA editing site prediction accuracy in Didymium
The first issue we address is to what extent the recently achieved complete knowledge of the edited Physarum
improves computational prediction of genes and editing sites in the Didymium
mitochondrial genome. Computational prediction of insertional RNA editing in the absence
of a reference transcriptome has been presented before 
. The method uses a position specific scoring matrix (PSSM) built from protein sequences from other organisms. It finds the editing sites for a given genomic sequence that translate to the putative protein with maximum similarity to the protein family described by the PSSM. With this method, we prepared predictions of editing sites in Didymium
using the PSSMs made before the Physarum
transcriptome was known 
. Since the edited Physarum
transcriptome is now known 
, new predictions utilizing that information were created and compared to these baseline PSSM predictions. Given that Physarum
both exhibit RNA editing and are closely related, the purpose of this process was to see how much the new predictions, which include transcriptome information from Physarum
, improve when compared to the baseline.
In order to be able to evaluate and compare the prediction quality we applied our prediction methods to mitochondrial genes in Didymium
for which the editing sites had already been determined experimentally 
. In total we created five predictions of editing sites for every gene. The first was the baseline prediction using the PSSM developed before the Physarum
transcriptome was known as described above. The second was a Physarum
based PSSM. The Physarum
based PSSMs were created from the NCBI website by starting a PSI-BLAST search 
with the homologous Physarum
protein sequence for each gene rather than with a homologous protein sequence from a more distant organism as in the creation of the original PSSMs. Three iterations of a protein PSI-BLAST search were run for each of the sixteen genes for which the Didymium
editing sites are known (see Methods
section). During this process we manually excluded the Didymium
protein for which the prediction was being made from the model building. Since the first round of PSI-BLAST did not find any homologs when starting from the Physarum
protein sequence for atp8, we could not create a Physarum
based PSSM for atp8 and excluded it from further analysis. The remaining three predictions did not use a PSSM summarizing the properties of a whole family of homologs. Instead, the plausibility of a putative Didymium
protein sequence (generated by inserting Cs into the Didymium
genomic sequence and translating the result) was quantified by aligning the putative Didymium
protein directly to the known Physarum
homolog. Since alignment scores depend on the scoring matrix and different matrices are tuned toward different evolutionary distances, we prepared one prediction each using the BLOSUM62, BLOSUM75, and BLOSUM90 matrices.
We scored the accuracy of a prediction by counting the number of correct and incorrect predictions made by each prediction method (see Methods
). We report the results as a percentage of editing sites in each category relative to the number of predictions the method included overall. The percentage of predicted editing sites is a better indicator than the absolute number since the computational model often does not make predictions for editing sites near the ends of genes. The results for each of the five prediction methods among all fifteen considered genes are shown in .
Accuracy of different prediction methods of insertional RNA editing sites in Didymium.
indicates that the PSSMs created without using Physarum generate the worst predictions showing that the inclusion of the Physarum genome does increase the accuracy of the prediction; a finding that is expected due to the similarities between the organisms and their shared RNA editing. Interestingly, the results show that predictions using the Physarum protein alone outperform the predictions using either of the two PSSMs. Among the predictions that use only the Physarum protein the prediction accuracy increases as the BLOSUM matrices are tuned toward more closely related organisms. This result implies that the organisms are so similar that the inclusion of the genetic information of other organisms into the PSSMs actually decreases the accuracy of the editing site predictions.
Since the prediction method relies on sequence homology we wanted to determine the influence of sequence similarity on the prediction quality. To this end we separated the fifteen genes into a more conserved and a less conserved group (see Table S1
) based on the nucleic acid conservation of the second codon position between the known Physarum
mRNA sequences. The prediction accuracy for all five methods in each of these two groups is shown in .
The overall trend in prediction accuracy is the same for the more highly conserved and the more diverged set of genes. However, although one might have expected that using the Physarum protein works better for genes where the two organisms have diverged less from each other, the improvement in prediction accuracy by using the Physarum protein sequences is actually bigger for the more diverged set of genes than for the more conserved set of genes. We rationalize this by arguing that genes with less conservation between Physarum and Didymium are generally under less evolutionary pressure and thus will also have diverged more between the Myxomycetes in general and the other organisms used to build the baseline PSSMs. Thus, including proteins from other organisms in the prediction hurts the prediction accuracy more for less conserved genes.
Prediction of new Didymium genes
Encouraged by the quality of the predictions on the already known Didymium
genes, we proceeded to use the method to search for other genes in the Didymium
mitochondrial genome and to predict the editing sites that are present in those genes. In order to do this, the genes first had to be located within the partial Didymium
mitochondrial genome available to us. This was accomplished by scoring segments of the partial genome against the corresponding proteins from Physarum
as described before 
. We used only the Physarum
sequence and the BLOSUM90 matrix since this approach performed best on the known genes as described above. Once the location of the genes were identified as the segments with the highest score, the editing sites were predicted. The results are shown in graphical form for each of the eight genes we identified in . The predicted mRNA sequences with C insertions indicated as upper case C's are given in Table S2
. We note that the predictions for nad2 only include part of the gene; a region at the 5′ end is not present as it is missing from the partial genomic DNA sequence available to us.
Graphical representation of the positions of predicted editing sites.
The two genes that stand out by their very low number of editing sites are nad3 and rpS11. Indeed, nad3 is already known to be unedited 
. Our prediction resulted in a single editing site in nad3 toward the end of the gene. While this addition of a single predicted editing site was not expected, it is understandable since the prediction of editing sites becomes more challenging toward the ends of the gene. Six editing sites were found in rpS11; one was found near the 5′ end, three in close proximity of each other in the middle, and two were at the 3′ end. Because of the low number of editing sites and the striking pattern of the predicted editing sites we hypothesize that rpS11 is also unedited just like nad3. The additional predicted editing sites at the end are easily understood based on the overall low prediction accuracy at the end of genes. Since the three predicted editing sites in the middle of the gene are close to each other they can also be a prediction artifact; omitting them would only change the protein sequence over the range of 5 amino acids. We verified that omitting the three editing sites would not create an in frame stop codon in the middle of the protein. Thus, is is plausible that the edited rpS11 mRNA could have been reverse transcribed and inserted into the genome of Didymium
as it has been hypothesized for nad3 
. We note, however, that while nad3 also has a very much reduced number of editing sites in Physarum
(around one every
nucleotides rather than the usual one every
nucleotides), rpS11 shows the normal level of editing in Physarum
While the graphs presented in the preceeding section convey the successes of the various computational methods for predicting the editing sites in the genes of Didymium
with known editing sites, there can be no similar comparison of the successes of the predictions in the new genes shown here as their exact editing sites are not known. However, sequence alignment of the new genes does show that of these genes, rpS4 and rpS11 fall into the less conserved group while cox3, nad1, nad2, nad4, and nad5 (as well as nad3) all fall into the more conserved category. Using this information, an estimate about the success of the BLOSUM90 computational method can be made for the new genes based the known results of the BLOSUM90 prediction method for the sequenced genes. This estimate results in Figure S1
which show the expected number of correct sites, the expected number of sites one or two sites from the actual editing site, and the expected number of more wrongly predicted sites with the associated errors. A true assessment of the success of these predictions will of course have to wait until these RNAs are fully sequenced and their editing sites are known.
Sequence conservation in the vicinity of editing sites
As indicated in the introduction, one of the major questions to be resolved is how the RNA editing machinery knows which sites to edit. Previous studies 
have looked for sequence patterns in Physarum
alone. One property within the mitochondrial genome of Physarum
is that editing sites have a strong preference to occur after a combination of a purine and a pyrimidine 
. However, many editing sites do not follow this pattern and many purine-pyrimidines are not followed by an editing site. Thus, this pattern alone cannot explain the extremely reliable recognition of editing sites and the problem of editing site recognition remains unsolved.
The absence of discernable sequence patterns among the Physarum
editing sites might suggest that every site (or small groups of them) are recognized individually. Such mechanisms exist in the kinetoplastids in the form of guide RNAs 
and in plant chloroplasts and mitochondria in the form of PPR proteins 
. If every editing site is recognized individually, no sequence pattern will emerge when comparing the sequences surrounding all the editing sites in one organism consistent with previous studies in Physarum
. However, when comparing organisms
at sequences surrounding their shared
editing sites, sequence positions that play a role in site recognition are under increased evolutionary pressure and should thus show more conservation across species than sequence positions not involved in editing site recognition. Thus, instead of looking at one organism at a time, we here used the edited genes of two related organisms with insertional RNA editing, Physarum
, and examined the patterns of sequence conservation
between the two organisms. We looked at the nucleotide identities at fixed positions relative to the editing site and investigated whether these nucleotides were conserved between the two organisms or not. In this analysis, we tried to identify positions relative to the editing site with statistically significantly increased degree of conservation from the background, which would indicate functional importance.
We studied the sixteen genes for which the editing sites are known in both organisms as described in the Methods
section. Comparing the complete mRNA sequences of the two organisms, we determined the overall degree of conservation for the first, second and third codon position. This yielded the background frequencies or the “expected” frequencies at the first, second, and third codon position. presents these background frequencies for conservation between Physarum
. The degree of conservation in these genes is relatively high and may not leave enough room to be significantly increased. Thus, we also studied as another group the subset of the 8 less conserved among the 16 genes (i.e., the genes the background frequency at the 2nd codon position of which is less than 85%, see the Methods
section). It can be seen in that the two groups share similarities in their background levels of conservation that are to be expected: the second codon position has the highest conservation, while the third codon position is the least conserved. However, there is a clear difference in the amount of conservation between the two groups as expected by construction of the less conserved group.
Background frequencies for conservation between Physarum and Didymium.
In order to study the vicinity of the editing sites shared by the two organisms, we first identified those C insertional editing sites that are shared and unambiguous (i.e., at least in one of the two organisms the neighboring nucleotides are not Cs). Because of the variations of background frequencies among different codon positions, these editing sites were separated by codon position. shows the number of these shared editing sites for each codon position.
Number of C insertional editing sites shared by Physarum and Didymium.
A previous study demonstrated that the DNA necessary for C insertion is contained within 9 or maybe 10 base pairs on either side of the editing site 
. Thus, we first determined the conservation information of the flanking sequences within a window of 9 positions upstream and downstream of each of the shared editing sites. Then we examined the difference between the observed sequence conservation in the vicinity of the shared insertional editing sites (at positions −9 to +9 relative to editing sites) and the background conservation.
shows the observed and the expected degree of conservation for positions −9 to +9 (relative to the shared editing site) for all genes separately for the shared editing sites at the first and third codon position (we do not show the data for the second codon position because of the small number of these editing sites which results in very low statistical significance). Results for the less conserved group are similar to the results for all genes (see Figure S2
). From these figures, we can see that both the observed frequency and the background frequency are position dependent with codon position being the dominant factor. The observed frequency is higher than the background frequency at some positions, while at other positions the observed frequency is lower.
Comparison of observed and expected conservation.
In order to see whether these variations are statistically significant, we calculated the probabilities for observing increased or decreased sequence conservation in the vicinity of the shared insertional editing sites based on the binomial distribution (see details in the Methods
section). shows those probabilities for positions −9 to +9 for all genes. No significant
-values are obtained (to take into account multiple testing, we use
-value cut off), which implies that there are no statistically significant variations between the observed frequencies and the background frequencies. In spite of the larger room for increased conservation, the results for the less conserved group are similar to the results for all genes (see Figure S3
Figure 4 -values for the differences between the observed and the background conservation.
We also extended our study to positions that are further away from the editing site than 9 nucleotides. For these positions, the analysis is complicated by the fact that additional editing sites can occur between the position to be studied and the editing site of interest thereby mixing different codon positions at the same position relative to the editing site of interest 
. We circumvent this problem by eliminating all primary editing sites from the analysis that have an additional editing site between the primary site and the position we are interested in. The disadvantage of this approach is that as one studies positions that are further and further away from the primary editing site, there are less and less sequences that contribute to the analysis and thus the statistical power decreases. In practice, we reached a limit of
contributing sequences at a distance of
nucleotides for editing sites at the third codon position and at a distance of
nucleotides for editing sites at the first codon position. However, even for these distances no statistically significant increase of sequence conservation was found (data not shown). This suggests that the information on editing site location is not contained within the sequence in the immediate vicinity of the editing site at least at the level of statistical significance set by our sample size.
This leaves us with the conundrum that on the one hand Rhee et al.
demonstrate experimentally that only 9 nucleotides of DNA on either side of the editing site are required for editing and on the other hand our results suggest that there is no statistically significant pattern of conservation within 9 (or even more) base pairs on either side of the editing site. Thus, we propose several possible explanations for the discrepancy between Rhee et al.
's findings and our results.
- Given that we only analyze 16 genes, our sample size might not be large enough to obtain results with high level of statistical significance.
- The RNA editing machinery does not recognize the same positions relative to the editing site at all of the sites. In this case, the increase in conservation in any fixed position may be too small to observe.
- The templates for editing site recognition in the two organisms are their own mRNA molecules. In this case, there is no need for increased conservation in the vicinity of the editing sites (or anywhere else). However, it is hard to envision an actual mechanism that uses the mRNA molecules themselves and that is compatible with the known co-transcriptional editing in Physarum
- Editing site recognition contains two steps: First, the RNA editing machinery recognizes the recognition sites for editing that are located far away from editing sites (at least outside the base pair range studied here) before the gene is transcribed, and the machinery will then put a marker (used for directing RNA polymerase) into the region within 9 base pairs of editing sites. At the time of transcription, the marker will then direct RNA polymerase to initiate the editing event. In this case, the editing information is located far away from the editing sites. But once the marker is set, it is the DNA within 9 bases of the editing sites that carries the marker and thus directs the editing as observed by Rhee et al. We call this hypothesis the “marker model”.
The first and second explanation lead us to consider how much increase in conservation for recognition sites of editing events we should see given the size of the mitochondrial genome of Physarum (62862 bp). Due to the extreme precision of RNA editing in Physarum, the recognition site of each editing event should be unique in the mitochondrial genome of Physarum. According to Rhee et al., the 9 nucleotides immediately upstream DNA and 9–10 nucleotides immediately downstream DNA of the editing sites are necessary and sufficient for editing site recognition. If the actual information on the editing site position is stored within these nucleotides this implies that the pattern recognized within the set of 18–19 nucleotides should occur at the rate of at most 1/60000 in a random DNA sequence.
Based on our calculations (see Methods
section), the lowest conservation for a set of 19 nucleotides that still allows specification of a site within the genome is 80.7%, i.e., at least we should see 80.7% conservation in a 19 nucleotide region responsible for editing site recognition. To test whether a conservation of 80.7% or more would show up as a significant difference between the expected frequencies and the background frequencies at our sample size, we set 80.7% (the lowest expected conservation) as the “observed frequency” for positions −9 to +9 (i.e., we used 80.7% to replace the real observed frequencies). Then we calculated the
-values for observing increased sequence conservation relative to the background frequencies in . For putative motif positions at the third codon position in the vicinity of editing sites at the third codon position we found a highly significant (compared to the cutoff of
. Thus, according to this analysis, we should have seen statistically significant variations between the actual observed frequencies and the background frequencies at our sample size even if the editing machinery does not recognize the same positions relative to the editing site at all of the sites. We thus conclude that the observed degree of conservation is significantly lower than what is to be expected when only the 9 nucleotides upstream and 10 nucleotides downstream of the editing sites contain the information for editing site recognition even if different sites use different combinations of the 19 nucleotides to specify the editing site location. These studies therefore suggest that the first and second of the hypotheses above can be ruled out.
As another test of which aspect of sequences around editing sites could determine the editing position, we tested the specificity of sequences around editing sites. In practice, we started by looking only at sequences immediately downstream of editing sites and examined the uniqueness of these sequences in Physarum
, that is, we tested for every sequence of
-mer) downstream of an unambiguous C insertion site in the known transcriptome of Physarum
-mer only occurs downstream of C insertional editing sites, but does not occur following non-edited sites of the sequences. This analysis is especially powerful since the full transcriptome has recently been determined by a high throughput sequencing experiment 
thereby giving complete access to all editing and non-editing sites for this analysis that compares all editing sites to all non-editing sites in all transcripts.
Since it is unknown if editing site recognition occurs at the DNA or RNA level, we tested the
-mers in both the unedited sequences and the edited sequences. We asked which is the largest
for which we can still find a
-mer that occurs at least once immediately downstream of an unambiguous C insertion site and at least once in a position that is definitely not preceded by an editing site. Both on the RNA and on the DNA level the largest
-mer we found was a
-mer. Thus, we conclude that a mechanism that uses only the downstream sequence of an editing site to specify the editing event, even if it is a different mechanism for every editing site, must use at least
nucleotides downstream of the editing site. Similarly, we found one
-mer combination that is not unique for unambiguous C insertional editing sites when testing the unedited sequences and one
-mer when testing the edited sequences when studying only sequences immediately upstream of the unambiguous editing sites.
Given that Rhee et al.
found that the 9 or maybe 10 nucleotides of DNA both downstream and upstream of the editing sites are responsible for the editing event, we also investigated the specificity of all the possible nucleotide combinations upstream and downstream of the unambiguous C insertional editing sites within 9 nucleotides on either side (describing them as
is the number of nucleotides upstream of the editing site and
is the number of nucleotides downstream of the editing site with
). Since in this case the unambiguous C is inserted inside the motif, testing the uniqueness of these motifs is different between the unedited sequences and the edited sequences. It is the same as before for the unedited sequences. For the edited sequences, we asked if the motif without
the inserted C occurs anywhere in the transcriptome (in addition to the at least one occurence with
the inserted C). We found possible combinations that are not unique for the unambiguous C insertional editing sites for both unedited sequences and edited sequences up to 9+5, 8+7, 7+8 and 6+9 nucleotide combinations. Therefore, the recognition region (if it exists) includes at least 9+6, 8+8, or 7+9 positions in agreement with Rhee et al.
's finding 
that the whole 9+9 nucleotides are required for editing. It is important to note that since this particular analysis does not rely on the comparison of two organisms but uses only Physarum
data it even in the case of hypothesis 3 (that the mRNA itself templates the editing sites) implies that the recognition has to involve at least the 9+6, 8+8, or 7+9 nucleotides surrounding an editing site. We would like to conclude by noting that while we did not find a non unique 9+9-mer which would rule out the “9+9” (or any larger) model (the identity of the 9 nucleotides both downstream and upstream of the editing sites carry the information for the editing event), we would not have expected to find one on statistical grounds alone since the probability for an 18-mer to occur in the mitochondrion of Physarum
is extremely small even if there is no biological reason (uniqueness of the recognition sequence) that prevents it from occurring. We want to emphasize again, that the conclusions in this section do not rely on a common mechanism that simultaneously recognizes all editing sites in one organism but apply even to mechanisms that recognize every editing site individually such as guide RNAs or site recognition proteins. Of course, all these considerations only exclude the information
for the editing site positions to be stored within the identities of the 9+5, 8+7, 7+8, or 6+9 nucleotides surrounding the editing site - it is still possible that the DNA in the immediate vicinity of the edited site carries a “marker” that is placed based on information elsewhere in the genome.
For editing sites within the coding regions, a significant codon bias is known. It has been found that in the mitochondrial genome of Physarum
, the third codon position has the largest number of C insertional editing sites, while the second codon position has the lowest number 
. As shown in the codon bias is also significant for the shared C insertional editing sites in both groups of genes.
A previous study proposed an evolutionary model which explains this codon position bias 
. The general idea of this model is the following. During the proliferation of Physarum
, nucleotide mutations (including substitutions, insertions, and deletions) occur at random positions in the mitochondrial DNA sequence. In the case of random deletions, the offspring can not survive because of the incorrect protein sequence since the mutated DNA sequence is out of frame. However, the editing machinery sometimes may insert back nucleotides to the positions of deletions and preserve the correct reading frame. In this case, the offspring can survive and proliferate. This idea of random creation of new editing sites is also consistent with phylogenetic data 
The net effect of a nucleotide deletion followed by the creation of a new insertional editing site is that the original nucleotide will be replaced by a C. The genetic code is organized such that the third codon position is the most irrelevant for the identity of the amino acid while the second codon position is the most relevant. Therefore, the third codon position is the least sensitive to nucleotide changes to C generated in editing events while the second codon position is the most. Thus, random deletions at the third codon position will have the highest survival rate and the lowest for the second codon position.
According to this model, the codon position bias in Physarum
is mainly a consequence of random mutations with selection at the protein level 
. This implies that genes that are under stronger selection should have a stronger codon position bias in their editing sites as well. Since for this study we have two organisms, we can directly determine the strength of selection on each gene from the sequence conservation. Thus, to test the theory proposed in 
, we examined the relationship between the strength of the codon bias and the overall sequence conservation in both Physarum
As described in the Methods
section, the 16 genes were divided into several groups according to their overall sequence conservation at the second (most conserved) codon position between Physarum
. The detailed group information is shown in Table S1
(since the conservation at the second position of the 16 genes ranges from 60% to 100%, we separated them into four groups by splitting the range from 60% to 100% into four intervals of equal length). We used the ratio of the number of third codon position editing sites
and the number of second codon position editing sites
as a measure of codon bias, and used the overall sequence conservation at the second codon position as a measure of the conservation within different genes. We examined all unambiguous C insertional editing sites in Physarum
, i.e., shared editing sites as well as editing sites specific for either of the organisms. The solid black squares in illustrate the relationship between the codon bias
and the conservation at the second codon position. In order to reduce statistical fluctuations, we also considered a grouping of the genes into only two groups by combining data of all genes with conservation between 60% and 80% at the second codon position into one group and the genes with conservation between 80% and 100% into the other group. shows that, whether the 16 known genes were separated into four groups or into two groups based on their conservation at the second codon position, genes with higher conservation at the second codon position have a higher ratio of
. This difference is significant even when statistical errors within the ratios are taken into account. This demonstrates that genes under stronger selection (or with higher conservation) should have a stronger codon position bias in their editing sites; thus reinforcing the theory that codon bias is a consequence of evolutionary pressure on the protein sequence.
Figure 5 Relationship between codon bias () and the conservation at the second codon position.
In order to increase the statistical significance it would be beneficial to include more genes in the study. Given our work presented above, the seven newly predicted Didymium genes and nad3 are likely candidates to add to the study. The problem with this idea is that the predicted mRNA sequences most likely deviate from the (unknown) true mRNA sequences, which might affect the accuracy for both the overall sequence conservation and the codon bias. Since we have the predicted Didymium mRNA sequences for all the 16 genes for which the actual mRNA sequences are known, we can test how much the estimates of overall sequence conservation and the codon bias differ between the predicted sequences and the true sequences.
To this end, we aligned the predicted Didymium
mRNA sequences and the real Physarum
mRNA sequences (see Table S3
for accession numbers) to obtain the conservation at the second codon position. Then we plotted the conservation data between the predicted Didymium
mRNA and the real Physarum
mRNA versus the conservation data between the real Didymium
mRNA and the real Physarum
mRNA. (see ). We found, that the overall conservation at the second codon position for the predicted sequences (predicted Didymium
mRNA and real Physarum
mRNA) and the real sequences (real Didymium
mRNA and real Physarum
mRNA) are very close to each other except for possibly two genes – atp8 and atp9 – which are much shorter than the other genes. This implies that estimating the overall sequence conservation from the predicted Didymium
mRNA sequences is a valid procedure since the difference in conservation by using the predicted sequences and the real sequences is small.
Comparison of (a) overall conservation and (b) codon bias for real and predicted mRNA sequences.
In the same way, we compared the codon bias between the predicted sequences and the real sequences. We treated the predicted Didymium sequences in the same way as the real Didymium sequences before (we examined all unambiguous C insertional editing sites and used four and two groups). As can be seen from , the codon biases for the predicted sequences and the real sequences are equal within the error bars. This suggests that codon biases calculated from the predicted Didymium sequences can reasonably be used in lieu of exactly known codon biases. However, the agreement between predicted and true sequences is not as strong for the codon bias as it is for the conservation at the second codon position.
Since the overall sequence conservation and the codon bias show only small deviations between the predicted sequences and the true sequences, we can add the seven newly predicted Didymium genes and nad3 to the codon bias analysis. In the same way as described for the 16 known genes, we analyzed the codon bias of these eight genes using the predicted Didymium sequences and real Physarum sequences. Since the determination of conservation at the second codon position from predicted sequences is more robust with respect to prediction errors than the determination of codon bias (see ) we performed this analysis twice. First, we only used the (known) codon bias in Physarum for the eight predicted genes (indicated in as 16 known genes + 8 genes in Physarum), thus only using the predicted Didymium mRNA sequences to determine the overall conservation for group division of each gene, but not using the codon bias in the predicted Didymium mRNA sequences which is less robust with respect to prediction errors than that for overall conservation. Second, we also included the predicted editing sites in Didymium for the eight additional genes in the analysis (indicated in as 16 known genes + 8 genes in Physarum and Didymium). In this case, all unambiguous C insertional editing sites in Physarum and Didymium for all 24 genes are counted.
illustrates the relationship between codon bias and the overall conservation for the 16 known genes (already described above), 16 known genes + 8 genes with editing sites in Physarum
and 16 known genes + 8 genes with editing sites in Physarum
with 2 and 4 groups, respectively. As can be seen from this figure, the strength of codon bias of the 24 genes (including the known genes and predicted genes) is not as strong as in the 16 known genes. However, given the reduced error bars the dependence of codon bias on selection pressure remains statistically significant. We have thus shown that more conserved genes have more significant codon bias in all unambiguous C insertional editing sites in Physarum
as suggested by the previous theory