In this study, our primary goals were to put forth a novel statistical background model to calculate k-mer enrichment and then to apply this model to the genomes of a set of related yeast species to gain new biological insights.
A suitable statistical background model to calculate “expected frequencies” of k-mers is essential for detecting over- or underrepresented DNA words 
. A number of such null models exist against which over- and underrepresentation of k-mers can be assessed. Most notably, Markov models of different orders (l<k, where ‘l’ represents the order of the Markov model) have been used as background models in several studies 
. Hampson et al. introduced another background model, dubbed the C0/C1 method, which uses frequencies of all the single base mismatches of a k-mer to arrive at an estimated count for that k-mer 
. In that paper, the authors compared their method against Markov models of varying orders and concluded that the C0/C1 model and Markov models of orders in the vicinity of k/2 (l
k/2) can serve as optimal models to identify over- and underrepresented k-mers. However, as noted in the paper itself, and as is the case with Markov models, it suffers from some amount of AT-rich compositional bias.
Our novel background model, i.e., the Ak-1 method, introduced here can serve as a useful null model against which k-mers can be assessed for their fold enrichment. Comparison of the top 20 k-mers arising from the average (k-1) mer method to the ones from the C0/C1 method and the Markov models shows that there is a significant amount of redundancy between them in their ranking of k-mers. Thus, some of the overrepresented k-mers detected by existing ‘optimal’ background models were also identified to be top ranking by our method. However, our method identifies additional over-represented k-mers not detected by any of the other methods. For example (), there are 10 k-mers in the top 20 list from the Ak-1 method that are not found in the top 20 lists from any of the other models. Also, our method compares favorably alongside the Markov model order 4, Z-score sorted protocol in reporting the least number of repeat-rich elements among the top ranking k-mers.
The same comparisons show that the Ak-1 method detects more GC-rich DNA strings as compared to the other methods. While a potential drawback in our method might be that it identifies too many such GC-rich strings, it nevertheless completely eliminates the AT-rich bias given that the genomic sequence from which the k-mers were derived had an AT content of up to 61.2%. Thus, our method may be the optimal one to use in scenarios that require the identification of repeat-poor, GC-rich strings that are over-represented in AT-rich input sequences and vice versa.
Our method outperforms existing over-representation detection methods in identifying known biologically relevant k-mers. This is supported by the comparative analysis of top 20 8-mers derived from the six different methods. Notably, four of the nine 8-mers uniquely detected by the Ak-1 method represent eight TF binding motifs, some of which are overlapping. Moreover, our method includes more TF binding motifs in the enrichment score based top 20 rank list as compared to other methods.
We also report for the first time the distribution of k-mer frequencies and of the fraction of over-represented k-mers across a biological meaningful range of values of k (4≤k≤16). Two important observations should be highlighted: (a) the fraction of genomic k-mers that are statistically overrepresented in all four yeast species analyzed decreases almost exponentially for k values between 9 and 12. Interestingly, a small, but significant, number of k-mers remain (up to k
16) over-represented irrespective of the method used for fold enrichment score calculation. Since our method captures fewer low complexity DNA containing strings, we hypothesize that a majority of these may represent longer elements that are functionally important. While some of these larger elements may have arisen due to recent chromosomal duplications, it seems likely that at least a subset of such elements serve some regulatory or structural function. (b) The overrepresentation calculation from our Ak-1 method identified almost all of the observed genomic k-mers as overrepresented up to and including k
9. This artifact could be due to the fact that the over-representation calculation for any given k-mer by our method is based on the number of occurrences of its corresponding (k-1) mers. And also because the “space” of novel k-mers explored for each increment of k increases approximately by a factor of 10, up to and including k
9, when the theoretical and observed k-mer counts begin to markedly deviate.
Thus, while our method can be used to rank k-mers based on enrichment for all values of k, we note a potential shortcoming of the same: our average (k-1) mer method may not be the optimal one to detect over-represented k-mers until a value of k is reached where the observed number of k-mers begins to diverge from the theoretically calculated ones.
Both theoretical and empirical distributions of k-mer frequencies or k-mer spectra in genomic sequences have been previously studied. In these studies, the authors have plotted the number of k-mers (for a given value of k) against the frequencies of their occurrence in the genomic region to draw the k-mer spectra. Csuros et al. analyzed the k-mer spectra from sixty diverse genomes and suggested that a Double Pareto-lognormal distribution (DPLN) best captures the multimodal spectra and also accounts for the heavy right tail of the graph associated with over-abundant k-mers 
. In a subsequent independent study, Chor et al. analyzed k-mer spectra for more than 100 genomes and reported unimodal genomic k-mer spectra for most species. The exceptions were the tetrapods, including all mammals, whose genomes and most genomic sub-regions exhibited multimodal k-mer spectra. They also underscored the suitability of low order (l<k/2) Markov models in recapitulating the multimodal spectra and heavy tails of such distributions, refuting the DPLN model 
. The 8-mer genomic spectra reported for S.cerevisiae
by Chor et al. is typically unimodal with the mode at a relatively low k-mer copy number, at around 200.
In order to glean more insights into the nature of k-mer spectra, we decided to analyze comparative k-mer spectra to learn more about the modality of the resulting distributions. Specifically, we addressed the question “how many k-mers in the genomic regions of S.cerevisiae exhibit similar fold enrichment scores across related yeast species and what is the modality of the distribution arising from the comparative analysis?”
Interestingly, the analysis of 8-mers derived from the whole genome of the four yeast species showed a positively skewed normal distribution for some combinations of species. However, for comparisons involving only S.paradoxus and S.bayanus, the resulting distributions were multimodal. This implies that there does not seem to be additional evolutionary selection pressure at both the high and low ends of the fold enrichment score ranked k-mers, at least at the genome scale level.
We were unable to observe any consistent trends across neighboring values of k or across the fold enrichment calculation method used for k-mers derived from either the regions 1 kb up- and downstream of all ORFs or for the intergenic regions. However, the multimodal distributions observed in some of these graphs suggests the existence of distinct classes of k-mers (reflecting diverse functional or structural elements), whose relative enrichment in the genome is regulated by evolution to some extent. We suspect that major changes in the relative abundances of these classes of k-mers may have deleterious effects for the organism.