To study the impact of background information on explanatory variables for genotype-phenotype relations in yeast, a 2-stage stepwise backward elimination procedure in L-PLS was used. We modelled each phenotype separately. The algorithm was illustrated using Melibiose Rate as an example. However, the performance was very similar also for other responses (phenotypes), as presented in Table . In total, we fitted 20 models, one for each phenotype. First, a genotype predictor matrix was derived by blasting the genes of each genome to a S. cerevisiae
reference genome, and the best hit scores were used as numerical inputs to a genotype matrix [4
]. Gene ontology terms, reflecting functional relatedness with regards to the gene product participation in similar molecular processes, together with data on gene dispensability (essential/not essential) and data on the number of gene paralogs present in founder genome, were used as background information. This data essentially reflects gene relationships in the S288C reference genome; relationships which may or may not be conserved in the species as a whole. We also included population genomic data reflecting the presence or absence, in each specific strain, of genetic variations with a potentially large impact on phenotypes. Gene copy number variations reflect potential gain-of-function mutations in particular lineages, whereas frameshift and premature stop codon mutations reflect potential loss-of-function mutations in respective lineages. The proposed model was fitted to each of the 20 phenotypes, and results are summarized in Table . Figure exemplifies the progression of the 2-stage variable elimination for the phenotype Melibiose 2% Rate, representing the rate of growth of the set of yeast strains when supplied with carbon exclusively in the form of the melibiose. The number of genotype variables, X
, and background information, Z
, remaining after each iteration is given in the figure. In the first stage, variable reduction with respect to X
, was carried out in eight iterations. In the second stage, the remaining three iterations eliminates the variables inZ
. We refer to this procedure, including both gene and background information, in an L-PLS approach, as Method 1 (M1). We compared M1 to a similar PLS approach, M2, which implements stepwise variable elimination on genes, with no background information included, and M3 i.e.
ST-PLS, again on genes exclusively with no background information utilized. Hence, M1 utilizes background information in the modeling while M2 and M3 do not. In Figure , the distribution of the information content of Z
, indicating to what extent Z
is relevant for explaining genotype-phenotype relations, in M1 is presented, together with a comparison of the complexity of the models, the number of selected variables and the root mean square error on training and test data. For each split of the data, the information content of Z
in M1, the number of used components and the number of X-variables were obtained. In the upper left panel, the degree of influence of Z
matrix in mapping genotype-phenotype relations is presented. The degree of influence, α
, range from 0 to 1 where a higher value indicates a high influence of background information in genotype-phenotype mapping. With an average α
7, this indicates that the background information, in general, have a very considerable impact on genotype-phenotype mapping. In the top right panel, we see that the genotype-phenotype mapping, when applied using the stepwise elimination procedure adopted in M1 and M2, requires a lower number of PLS components than M3 to explain the phenotype pattern. Hence, M1 and M2 constitute less complex models than M3, because M3 ends with a higher number of components and a higher number of chosen variables. The lower left panel indicates that M1 selects a significantly lower number of genes for the genotype-phenotype mapping than M2 and M3. This means that noise, in terms of genes that do not actually contribute to explaining the phenotype, is substantially reduced when background information is included in the modeling step. The lower center panel shows that for the training data there was no significant difference in RMSE between M1 and M2, but both were lower than the RMSE for M3 (p
1). When applied on test data, all methods resulted in acceptable and similar RMSE, indicating that overall methods perform equally well (Figure e), lower right panel). However, M1 could achieve this performance using a much smaller number of variables. The number of variables required is a measure of the understandability of the model; hence, we conclude that M1, including background information in the PLS modeling, should allow for easier and more straight-forward interpretation of results.
A key requirement of any multivariate analysis is the stability and selectivity of the results. To evaluate model stability and selectivity, we [17
] recently introduced a simple selectivity score
: if a variable is selected as one out of m
variables, it will get a score of 1/m
. Repeating the selection for each split of the data, we simply add up the scores for each variable. Thus, a variable having a large selectivity score tends to be repeatedly selected as one among a few variables. In Figure , the selectivity score is sorted in descending order and is presented for X-variables (genes) in the upper left panel for M1, the upper right panel for M2 and the lower left panel for M3. The selectivity score indicating the stability of the selected Z-variables (GO- terms) obtained from M1 is presented in the lower left panel. M1 indeed selected many genes in a stable way, which is a fundamental requirement for any further analysis. A selectivity score above 0.2 for X-variables and above 0.06 for Z-variables is significantly larger than similar scores obtained by repeated fitting of models using random permutation on the phenotypes. Since traits are controlled by subsets of distinct genes [4
], and some genes in the genome are of overall importance for handling variations in the external environment and affect a disproportionate number of phenotypes [26
], we expect any method extracting relevant biological information to have a higher selectivity score than any random selection of genes. This was indeed the case for our proposed method M1. In fact, using the two-step L-PLS procedure, only 30 genes were selected from M1, corresponding to substantially higher selectivity than M2 and M3. Not surprisingly, these genes OLI1, YEH1, ATP8, PSY3, IFM1, SUV3, CAR1, ERG6, ILS1, YDR374C, SHO1, YDR476C, GLO3, APL5, RIX1, GPR1, VAR1, TTI2, YLR410WB, YDL211C, YDL218W, EHD3, MRPL28, RPT6, COX17, STE11, SUR4, YAP1, MRPL39, YNL320W, were involved in cellular functions directly relating to variations in the environment: transport, stress response, response to chemical stimulus and metabolism. They also tended to be affected by both strong loss-of-function (premature stop codons, frameshifts) and gain-of-function (copy number variation) mutations, as presented in Table . We found 72.% genes overlap between M1 and M2 and 67.9% genes overlap between M1 and M3 for Melibiose 2% Rate. The selection of background variables can be missed if only a few of the corresponding genes are significant [14
], but the powerful structure of L-PLS, coupled with 2-stage stepwise elimination procedure, yields a stable list of genes and background information variables, which maps the genotype-phenotype relation. Finally, we have listed the mapped background information and genes for all 20 phenotypes in Table and Additional file 1
: Table S1 respectively.
Assuming that phenotypic variation within the species is controlled by either or both of lineage specific adaptive mutations, emerging as a consequence of lineage specific positive selection, or neutral variation, emerging as consequence of lineage specific relaxation of selective pressure that allow loss-of-function mutations to accumulate, we expected phenotype defining genes to show faster evolution than non-influential genes. This corresponds to a prediction of a higher ratio of nonsynonymous versus synonymous mutations since the split between S. cerevisiae
and its closest relative Saccharomyces paradoxus
]. Indeed, we found genes identified as influential through M1 to have been evolving 29% faster than non-influential genes (p
10). This indicates that these genes, as a group, have been subjected to either stronger positive selection or somewhat relaxed negative selection during the recent yeast history and supports that M1 extracts biologically relevant information.