We obtained textual description of phenotypes and a reference to their associated gene from the PhenomicDB database. For text mining purposes, the descriptions had to be properly adapted and prepared (stemming, etc.). We use the working term 'phenodoc' in the following to refer to this adjusted form of phenotype description and use 'phenocluster' to refer to a cluster of 'phenodocs'. We clustered the resulting 39,610 'phenodocs' associated to 15,426 genes from 7 different species into 1,000 clusters based on the cosine distance between 'phenodocs' using the k-means algorithm on a vectorized representation of the documents. We studied the resulting groups from a number of perspectives to assess whether or not the grouping itself is biologically reasonable. Finally, we predicted gene function within each cluster and evaluated this method using cross validation. All methods are described in detail in the 'Methods' section.
Of the 1,000 clusters, 90.4% are single species. Figure shows the distribution of clusters into different sizes. Figure details the distribution of genes by species (independent of the clustering) and the distribution of species in clusters (dependent on the clustering).
Distribution of cluster sizes. The diagram shows the distribution of the number of clusters in different sizes.
Figure 2 Cross-species phenotype data distribution. The left pie chart depicts the distribution of genes by species, i.e. the relative number of genes in our gene set according to species affiliation. The right pie chart shows the distribution of clusters according (more ...)
Proteins of genes within a 'phenocluster' intensively interact with each other
To test whether 'phenoclusters' consist of genes with a high chance of being part of a common biological process, we studied whether the proteins of genes within one cluster interact with each other more often than of genes in random control groups. This approach derives from the observation that physically interacting proteins have a higher chance to be part of the same biological process or pathway than non-interacting proteins [18
]. To this end, we downloaded protein-protein interactions from the BioGrid database represented by Entrez Gene IDs (see 'Methods'). We then analyzed the degree of interactions among the members of a given 'phenocluster', and compared those figures to random gene groups of similar size.
In 60 clusters (from 1,000) comprising 1,858 genes, all genes interact with at least 75% of the rest of the genes from the same group within at most two intermediates (empirical p-value smaller than 0.05). Thus, those clusters consist of genes which almost build cliques in the protein-protein-interaction network. Such quasi-cliques previously have been associated to functional modules [19
]. In another 138 clusters, comprising a total of 4,322 genes, all genes interact with at least 33% of the rest of the genes in each group. We compared these numbers to 200 repetitions of randomly sampled control groups. In this dataset, there is on average only one group reaching the threshold of 75% and two groups reaching the threshold of 33%.
These figures show that clustering of 'phenodocs' results in gene groups whose members much more often interact with each other than expected by chance and thus represent coherent biological knowledge. However, the interaction score of the rest of these clusters is not significantly higher than in the control groups. We shall later exploit this difference to sort clusters based on this score to see whether the prediction of function is improved in highly interacting clusters.
We believe that the large number of those non-interacting clusters is mostly an artefact of the current incompleteness of PPi data sets in BioGrid, with the notable exception of Saccharomyces cerevisiae. Therefore, even highly interacting 'phenoclusters' will not necessarily mimic the PPi network due to the diverse nature of phenotypes or a lack of data (both on the PPi and the phenotype side). Figure exemplifies e.g. the lack of phenotype data showing genes from a 'phenocluster' with many connected proteins (blue nodes), where we have added a-posteriori and coloured in red those genes having interacting proteins but no phenotype described yet. In contrast, in Figure the clustered blue nodes are again supplemented by nodes from the PPi data. Here, we find nodes added a-posteriori with phenotype data (green) that have been clustered elsewhere and one single unconnected node. Both these examples show that phenotype data is only in some way congruent with PPi data (see 'Discussion'). Nevertheless, our 'phenoclusters' give insight into the structure of biological networks and can be used to identify new members in a sub-network not detected by other methods, e.g. the only unconnected node in Figure could be such a case.
Figure 3 Protein-Protein interactions derived from one 'phenocluster' and genes lacking phenotype data. The figure shows an example for interactions between proteins from genes in a 'phenocluster'. Depicted is a network with many genes from the same 'phenocluster' (more ...)
Figure 4 Protein-Protein interactions of proteins derived from several 'phenoclusters'. The figure shows an example for interactions between proteins from genes of several 'phenoclusters'. Depicted is a network with many genes from the same 'phenocluster' (blue (more ...)
Genes in 'phenoclusters' have coherent GO-annotations
The Gene Ontology (GO) has been widely recognized as the most comprehensive functional classification system and has become a de facto
international standard for functional annotation and prediction [20
]. It should be noted here that in PhenomicDB, Gene Ontology terms are associated to the gene descriptions and are not part of our 'phenodocs' (unless by rare coincidence, i.e. when authors had used terms in the free text descriptions that may also occur in GO). Therefore, as a second way of evaluating the biological meaning of 'phenoclusters', we computed the similarity of the GO-terms assigned to the genes of a group (see 'Methods' for calculation and interpretation of the following similarity scores). In the analysis of our 1,000 'phenoclusters', we found 206 clusters containing 1,800 genes with a GO-similarity score ≤ 0.4. For each distinct group size we built 200 control groups from randomly picked genes. Only two control groups reached this threshold by chance. We furthermore computed the correlation of the average GO-similarity with the average phenotype similarity of clusters. The Pearson correlation coefficient r
was 0.41, indicating a shared variance in both similarity scores, approximately 16% higher than expected by chance.
This shows that phenotype similarity is indicative for a high probability to share GO-annotations between the associated genes. In Table we present an exemplary cluster with a GO-similarity score of 0.9. Of all terms associated with this group, there are 5 terms annotated to 14 out of 17 genes. Due to the homogeneous nature of the annotations, one can hypothesize that the remaining 3 genes should receive the same common annotation as the other 14 genes. We shall build on this idea later when we predict GO-terms in 'phenoclusters'.
'Phenocluster' with 17 associated genes with a GO-score of 0.9 in the Biological Process subtree.
Phenocopies co-occur in 'phenoclusters'
A phenocopy is an environmental effect of a single trait (phenotype) that mimics the effect of a trait produced by a gene, which is in this case intact, i.e. wild-type. However, there are also phenocopies independently induced by different genes. In an extensive manual search of Medline literature on phenocopies induced by genes, we have identified 27 of such phenocopies, induced by 57 genes in total (see Additional file 1
for details on the phenocopies and the literature evidence). If our 'phenoclusters' properly reflect phenotype similarity on a biological basis, the genes causing phenocopies should co-occur within the same clusters. Of the 27 phenocopies induced by 57 genes we have retrieved from literature, 25 phenocopies (55 genes) were in our data set. In our 1,000 'phenoclusters', the genes of 13 (54.2%) phenocopies co-occurred in a cluster. In 1,000 random clusters of the same sizes none of those genes co-occurred in any cluster.
Predicting gene function within 'phenoclusters'
Based on the previous results, we hypothesized that gene function can be predicted based on the association of genes to 'phenoclusters'. If gene groups based on 'phenoclusters' have a coherent GO-annotation, we should be able to predict similar functions in genes from the same cluster (see 'Methods').
In evaluating the correctness of a GO-annotation prediction, one has to consider the structure of the gene ontology. Recall that GO-terms form an ontology, and that terms are connected by IS-A and PART-OF relationships. The simplest case would be to consider a prediction as correct only when it appears exactly as it is in the test data. However, this measure is overly harsh, since terms being a little more general or more specific are also very useful from a biological point-of-view. In the following, we therefore give results for different definitions of 'correctness' of a prediction of a term. In the most stringent case, we consider a term to be correct only if it appears itself in both, test and training set. Thus, predicting a child of a term actually counts negative twice – as a false positive and a false negative. Because this measure is much stricter than that of other studies (see for instance [23
]), we also studied how our figures change when we apply a less stringent criterion for 'equality' of GO-terms.
In the following section, we present values for precision and recall of GO term predictions for different subgroups of genes from our 'phenodocs'. Our 'predictions' show the percentage of overlaps between the true annotations of a set of genes (test set) and 'predicted' terms which are derived from a training set (according to the Entrez Gene2GO annotations – see 'Methods' for further details).
To explore the upper limit of 'predictability' of GO-terms based on phenotype clustering using our method (the so-called 'precision ceiling'), we first ran the following experiment. We performed function prediction for all gene groups based on the clustering of the 'phenodocs'. For each group, we computed precision and recall of the predictions. We then selected the 10% highest-scoring clusters sorted by the harmonic mean of recall and precision (so-called F-Measure). Thus, clusters were selected a-posteriori based on their performance in prediction. Of course, this measure cannot be extrapolated to the result of a prediction on unknown groups; however, it gives a good estimate on the maximum performance achievable using our data set and our approach. Function prediction from only these groups yielded an average 81.5% precision and 61.2% recall. Considering this as upper limit, we strived for criteria for selecting appropriate gene groups a-priori.
Results for different filter criteria
We defined a number of filters for selecting clusters, based on criteria such as the number of genes they contain, the number of available annotations, and their scores for in-group annotation coherence and in-group connectedness. We defined five different filters which are described in Table . We calculated precision and recall of function prediction in all clusters selected by different combinations of those filters, see Table (refer to 'Methods' for details on filters and evaluation). Using the least stringent filter (Filter 1), but the strict criterion for judging the identity of GO terms, the number of clusters was reduced to 856 by filtering all clusters containing less than 3 genes and reduced once more to 295 by filtering all clusters without any descriptive GO-terms (i.e. any Biological Process terms assigned to at least 50% of cluster members). We predicted 345 distinct GO-terms from the Biological Process subtree at a precision of 67.9% and a recall of 23.0%, averaged over all selected clusters.
Different criteria for filtering clusters for function prediction
Results for different filters applied to gene groups (k = 1,000).
Relaxing the criteria for GO-term identity, now allowing for a single deviation towards the root (i.e., a predicted term is considered correct if it exactly matches a removed term or if it matches a parent of the removed term) resulted in an average 75.6% precision and 28.7% recall (191 unique terms for 2,686 genes in 279 groups). Allowing one more step towards the root, we predicted 151 unique terms with 76.3% precision and 30.7% recall.
If we used for function prediction only those clusters that pass Filter 1 and that show an average GO-similarity ≤ 0.4 (Filter 2), the averaged precision dropped slightly to 62.5% and recall increased to 26.2% (74 groups, 711 genes and 159 predicted distinct GO-process terms). This drop in precision and increase in recall is due to the increasing number of predictions made per gene and group and is explained in more detail in the following sections. Applying again a less stringent criterion for identity of GO-terms as explained above, we derived an average 75.3% precision and 31.7% recall in the first step towards the root (91 unique terms for 612 genes in 80 groups). When we selected only those clusters containing genes from only one species (Filter 4), the values for precision and recall stayed roughly the same. This was expected as 90% of all clusters met to this condition (see 'Discussion'). The values for precision dropped slightly and for recall quite dramatically when we used only cross-species clusters (Filter 5).
To our surprise, average precision and recall dropped (to 60.5% and 19.8% respectively; 53 groups, 409 genes and 102 GO-terms) when we used only those clusters that show a PPi-connectivity of at least 33% (Filter 3). In a recent study [24
] it was reported that 35% of interactions occur between proteins with no common functional annotation. We believe that lack of common functional annotations in relatively small groups of immediate neighbours in PPi-networks explain our surprising drop in precision and recall when using only these groups. Nevertheless, both enrichment in pairwise interactions and common GO-terms show the high biological coherence of 'phenoclusters'. We conclude that despite some shortcomings in the data, 'phenoclusters' appear to be another suitable source functional annotation prediction.
Selecting gene groups from PPi-cliques
To see whether our prediction method using 'phenoclusters' exceeds the use of another non-random gene selection method; we grouped our 13,068 initial genes based on direct pair-wise interaction. We found 2,875 groups in which each gene interacts with each other (i.e. cliques in the PPi graph). Applying Filter 1 on this data set, we derived 720 groups resulting in 3,692 predictions with a precision of 56.4% and 32.3% recall. Thus, the precision of this approach (which is similar to the method applied in [19
]) was about 10–20% less precise than our method of clustering genes based on 'phenodoc' similarity.
Clustering phenotypes with different values of k
K-Means is a clustering method that requires the a-priori
determination of the number of clusters k. Typically, to assess cluster quality internal and external measures are evaluated [25
]. External measures, however, as e.g. a comparison with a gold standard, cannot be applied here due to the lack of a gold standard for clustered 'phenodocs'. As internal measure for cluster quality, we sought to gain insight how the data structure changes by choosing different values for k, ranging from 500 to 3,000 (Table ). The results show a number of interesting facts. Firstly, the average number of genes per cluster clearly decreased with increasing k. However, the percentage of clusters that comply with Filter 1 in Table stayed roughly the same. Although those clusters on average contained fewer genes, the number of predicted annotations and affected genes increased considerably with increasing k. This indicates that the top clusters – selected by Filter 1 – become more homogeneous with increasing k, as more clusters have more terms which are annotated to more than 50% of their members. Partly, this is also a statistical effect of the decreasing cluster sizes which naturally lead to more homogeneous groups. At the same time, the precision drops slightly with increasing k while recall increases considerably. This means, that more predictions come along with more errors, but the ratio of errors to the overall amount of predictions decreases. Another effect is that in smaller clusters, there is usually only a single gene left in the test set. The increasing recall shows that more terms from the test set are descriptive in the training set, but the decreasing precision means that the number of terms associated with a single gene cannot compensate for the number of suggestions derived from the training set.
The distribution of clusters with their characteristics given different values for k (the number of clusters) from 500 to 3,000.
While the correlation between GO-similarity and phenotype similarity drops significantly for increasing k, the percentage of single-species clusters increases. This is an indication that the homogeneity within clusters mentioned above shifts from a functional to a methodical, i.e. a descriptive homogeneity owned by the fact that similar vocabulary – from the same species – yields less variance than similar function.
Thus, k is an important parameter to balance the trade-offs between precision, recall and number of predictions. One can either choose a small k-value, resulting in few high quality predictions, or a larger k-value, resulting in a much larger number of less accurate predictions. Clearly, the choice of the k-value depends on the concrete application. As our goal was the best precision with acceptable recall, we found k = 1,000 most suited, although a large k (k = 3,000) resulting in many small clusters yields the best technical solution with an F-Measure of 0.385 (precision = 60.3% and recall = 28.3%).