Genome-wide surveys of genetic variation can be approached today with NGS techniques that were not available only 5 years ago. The efforts put since 2009 in determining the full genomic sequences on a large number of
S. cerevisiae strains, have accumulated a huge amount of data. In fact, currently more than 60 strains have their genome sequenced in varying degrees of completeness, evidencing large variations not only in non-coding regions, that can affect gene expression level or determine an altered regulation, but also in coding sequences and, surprisingly, in ORFs that were previously stated as essential in the reference strains (
30). As 20% of
S. cerevisiae genes are essential in laboratory conditions, they might be unnecessary in other growth conditions, so their sequence variation could make the yeast still viable when living in a ‘wild’ environment. All these observations imply caution when a strain has to be assigned to a certain group or verified in production lots (e.g. in starter batches for beer or wine production).
It is not reasonable to systematically approach the problem of phylogenetic assignment of yeast strains using whole-genome sequencing, since, though the cost per genome is rapidly decreasing, the efforts put into assembling data still consume a great deal of time. What is feasible, instead, is to use the already available data to search for new markers that recapitulate the phylogenetic relationships among strains evidenced with genome-wide surveys. With this scheme in mind we developed a computational pipeline that collects all the genes shared by a population of sequenced strains (or organisms in general) to systematically evaluate the phylogenetic performances of single genes as well as virtual combinations of them against a generally accepted tree or other clustering schemes of interest, searching for specifically addressed molecular markers. The reference tree we used in this work was the one proposed by Liti and coworkers in 2009 (
12), that identified five clusters not clearly associated either with geography or type of production (the so-called ‘domesticated’ strains) or isolation source. Those clusters, and their composition in stains, were used as a ruler for screening the trees obtained with the different combinations of genes.
In the past 10 years, several hyper-variable loci have been proposed as molecular markers for phylogeny at the strain level. Many of them involved non-coding regions (e.g. microsatellites) that were classically used for the characterization of other organisms. Although interesting, hypervariation in ORFs was of much more interest to us since it evidenced alterations that could be more easily understood and discussed in terms of function and biological role.
Interestingly, we observed that SNPs/indels were present in all but 41 out of 5850 genes shared by the 39 strains (
Supplementary Table S2), so the variation in ORFs was higher than what one could forecast
a priori. It is interesting to note that many of the 41 genes that showed no variation among the strains of the learning set are part of the protein synthesis machinery (14 genes are included in the KEGG ribosome pathway as associated to ribosomes) or participate in processes such as respiration, cell division or active homeostasis. Those coding genes belonging to fundamental cell processes are highly conserved in many other fungi as well, though 100% conservation is not present in
S. paradoxus, the species most closely related to
S. cerevisiae. It is also intriguing to observe that 14 of such highly conserved genes are annotated as dubious (nine genes) or uncharacterized (five genes) in SGD. When additional eight genomes used as a validation set were included in the analysis, 24 out of 41 genes still presented no variations. Note that the quality of the validation genomes was not as high as that used in the learning genomes, showing problems with annotations and coverage, indeed only 36 of these 41 genes were present in all this new genomes. The observation that some of these genes, whose function is still unknown, are highly conserved highlights the existence of a still unexplored universe that surely will deserve further analysis even in the well-known organism
S. cerevisiae.
We assessed the 24 conserved
S. cerevisiae genes in 36
S. paradoxus strains. We found that in some of the
S. paradoxus strains 4 of the 24 genes were not annotated and potentially not present. Some of the genes showed very little conservation but in general at least 2% variation was observed (mean 10.62

±

1.51%). This analysis reinforces the idea that at least 24 highly conserved genes are characteristic of the
S. cerevisiae species.
We confirmed that, using the SNPs/indels extracted from entire genome sequences, we could separate the 39 strains into the five lineages (Malaysian, West African, North American and Wine/European). In the original work of Liti and coworkers 249

178 mutational events were described using both coding and non-coding regions. We were able to reproduce Liti's tree with 226

961 variations (both SNPs and indels) in coding regions only, meaning that the largest part of the variability was in ORFs and that possibly the driving force of evolutionary separation between strains acted on ORFs.
This result confirms previous reports describing a dominant role of
cis variations in shaping functional differences between yeast strains (
3,
31).
It is beyond the scope of this work to discuss in detail the biological implications of having alterations in the 13 genes we propose as candidate markers with optimal phylogenetic performances and resolution. Nevertheless it is intriguing to notice that the 13 genes we propose as candidate markers are described to be involved in only three major cellular processes: cell wall and cell membrane metabolism (YAR042W, YJL099W, YKL068W and YNL125C), DNA synthesis (YML056C), replication (YBR163W) and expression (YBL052C) or translation (YML080W, YOR133W and YPR152C) (see for a more detailed description).
The study of relative rates of nucleotide changes could help to identify genes under particular biochemical or ecological constraints (
32). If the nucleotide variation in a coding region introduces a non-synonymous change in the coded amino acid, it may influence protein function. To understand the dynamics of genomic sequence evolution, we studied the relationships between synonymous and non-synonymous changes (dN/dS) among the 13 genes we propose as genomic markers. From a formal point of view, we are considering non-synonymous to synonymous-site polymorphisms (that could be referred to as pN/pS), since the divergence time of strains is not sufficient for a mutational event to be fixed. Considering this rate averaged for all the strains, the majority of genes presented a very low ratio (<0.25) indicating that, on average, synonymous mutations occur much more often than non-synonymous, and therefore they probably are under strong purifying selection. Two genes, however, present a dN/dS

>

0.4. Even if this rate is lower than 1, there is evidence from population genetics studies that positive selection is expected to produce dN/dS <1 within members of the same species (
33). Since it has been suggested that in this context the dN/dS is relatively insensitive to the selection pressure (
32,
33), we cannot exclude that the two genes are under a positive, rather undetectable selection.
We then investigated the nucleotide changes among lineages in order to detect, for instance, adaptive evolution (
34) or relaxation of selective constraints (
35) between them. The maximum likelihood analyses showed that the dN/dS ratios are highly variable among different evolutionary lineages. In particular, the genes YJL051W and YKL068W, involved in molecular intracellular transport, show strong positive selection in European strains or wine producers. The variation in amino acid sequences of transporters, interacting with other molecular entities (i.e. other proteins, lipids, RNA), ensures the ability to maintain essential physiological functions of the cell, even if the transported entity changes in conformation or composition. These results are in agreement with previous findings (
36) showing the transcriptional cascade caused by natural sequence variation in the sensor
SSY1, master regulator of transport of essential amino acids (
3).
In order to compare the amount of molecular variation within a species to the divergence between species, we applied the MKT (
29). Under the null hypothesis, the ratio pN/pS is expected to be equal to the dN/dS, because all non-synonymous mutations are expected to be neutral. Briefly, under neutrality, pN/pS is equal to dN/dS and thus the NI (computed as [pN/pS]/[dN/dS]) is equal to 1. However these ratios will not be equal if some non-synonymous variation is under either positive or negative selection (
25). A total of six genes had NI values significantly >1 and it can hardly be explained by chance that these genes are all involved in membrane and cell wall metabolism. These sequences presented deleterious mutations that rarely become fixed in the population and that, though not truly contributing to the divergence, still made a significant contribution to polymorphisms. In other words, negative selection is acting in order to prevent the fixation of harmful mutations in these genes.
Two genes (ETF1, IMD4, corresponding to YOR133W and YML056C ORFs, respectively) were identified as having a high codon adaptation index, resembling the codon usage of highly expressed genes. Certainly, these correspond to highly conserved proteins and their very low dN/dS ratio may be a reflection of their importance in the cell.
The gene (YNL161W) presents a nucleotide variation that introduces alterations of the coded protein, presenting large deletions in some strains. Intriguingly, three of those four genes code for proteins involved in morphogenesis (YNL161W) or in molecular intracellular transport (YJL051W and YKL068W).
The hypothesis of a strict association between evolution and cell and colony morphogenesis has been proposed by expression data (
37) and later suggested in a review recapitulating the evolution of yeast morphologic diversity (
38). Our results, indicating the YNL161W gene (
CBK1) as sufficient to recapitulate by itself the entire genomic intra-strain diversity, represent an exciting cue for the investigation of evolutionary traits of yeast.
All the ORFs proposed in the past as candidate phylogenetic markers failed, in our hands, to reproduce the Liti tree if used as the sole variation source. We tested
CCA1,
CYT1,
MLS1,
PDR10,
ZDS2 (
7),
FZF1,
SSU1,
CDC19,
PHD1 (
5) and others, combining them to test whether we can improve their classification power. In effect, we found a certain degree of correct separation using doublets or triplets, but they were
de facto unable to resolve strains sharing the same cluster. Possibly their variability was insufficient or not sufficiently spread along the sequence to allow them to be considered as true phylogenetic markers, given the new rules.
We finally found that classification rules (e.g. different clusters formed by different strains) other than those proposed by Liti failed to be matched by any gene or gene combinations. This tends to confirm the validity of the five-cluster rule and to keep that classification scheme into consideration for future studies on phylogenesis of natural, industrial or clinical strain populations.
The pipeline we designed and implemented was exhaustively tested and supplemented with functions that precisely addressed biological questions such as (i) what are the best molecular markers for that population? (ii) do those markers reflect some already proposed classification scheme? (iii) is there a way of improving existing phylogenetic markers? According to the various questions, the pipeline can autonomously test singletons, doublets and triplets of markers as well as arbitrarily supplied combinations of a desired number of genes (or all genes, if a full genome survey is needed). Future implementations will allow improvements of the currently employed phylogenetic methods (distance-based methods associated to neighbor-joining and bootstrapping) and introduction of phylogenetic tests to statistically assess population structures. Sequencing all or a set of the proposed genes in a wider population promises to reveal the ecology and evolutionary history of S. cerevisiae.
We show how the availability of a whole-genome-based population structure makes it possible for the first time to probe the complete gene-pool of a population for genes or combinations of genes that recapitulate the phylogenetic relationships between the individuals of a population. Sequencing entire genomes is indeed of profound interest. Our approach does not substitute sequencing but rather uses whole-genome information to select genes useful as molecular clocks at the population level, genes whose sequence is sufficient to reconstruct the population structure much faster and with much reduced costs. To date, the approach we propose as successful in yeast strain population analysis could in principle be generalized to any other population of individuals, provided that their population structure have clearly defined lineages that can be addressed unambiguously. Importantly, since we evidenced that sequence quality can be a major concern when identifying variation across very similar organisms, we recommend only top-quality sequences to be introduced into the pipeline, preferring a reduced CDSs set rather than low quality full genomes. Given that, our approach can therefore be used for screening purposes on large collections of samples to select those of interest for further whole-genome sequencing. It must be emphasized that arbitrary clustering schemes can be imposed to satisfy scopes not necessarily addressed to a population study.