Here we have described a global strategy to predict candidate sequences for orphan enzymes. Candidate sequences were obtained using a combination of metabolic pathway adjacency and genomic neighbourhood information. Overall, a lower proportion of candidate sequences were obtained using metagenomic data, than genomic data, but this might only be due to the restrictions we had to impose: Sanger and 454 samples that have a low coverage of the respective genomes. Although many novel enzymes and organisms may be represented in metagenomic samples, the human gut and marine metagenomes that we used are complex communities with hundreds of species (Qin et al, 2010
), and a long tail of low-abundance organisms (Arumugam et al, 2011
), thereby limiting the coverage of each individual genome and thus the extent of assembly. Consequently, the majority of the contigs that we analysed only contained two genes, thus limiting the number of neighbour gene pairs that can be detected (Supplementary Figures 7 and 8
). Although some available metagenomic datasets have a large number of long contigs, these are usually dominated by a few genomes and thus would not offer access to an increased number of genomes (Tyson et al, 2004
; Garcia Martin et al, 2006
). In the future, contigs will become longer, due to increases in read lengths and improvements in assembly algorithms, therefore enhancing the ability of this pipeline to make predictions from metagenomic data allowing greater access to novel activities of hidden environmental samples.
In addition to the benchmarking, we supported our predictions with the experimental validation of the proposed enzymatic function for two out of six heterologously expressed candidate proteins. The ratio of experimental successes is lower than the 70% expected accuracy. However, we would not expect the ratio of experimental successes to be equivalent to the theoretical prediction accuracy. The experimental process to validate a specific enzymatic function is a very complex process involving many variables. First, an enzyme can be purified in a soluble form but will become inactive during the purification process due to improper handling or exposure to unfavourable conditions such as oxygen. In addition, the proteins purified in this study were tagged with a histidine (his-tagged), as many heterologously expressed proteins are. The addition of a terminal his-tag can dramatically decrease the activity of a protein (Kadas et al, 2008
) or render it totally inactive (Albermann et al, 2000
; Halliwell et al, 2001
). Moreover, there are many variables to optimize for the enzymatic activity tests. Only by adjusting the buffer type, buffer pH, cofactors, time of incubation, temperature of incubation or the analytical methods used might a certain assay become successful. For example, in assay optimization trials for EC 188.8.131.52 we changed the mobile phase for the LC/MS from 10
mM ammonium acetate to water and the peak area of the product glutamate was increased more than 11 times (Supplementary Figure 16
). However, there is a practical limit to how many permutations of experimental conditions can be attempted, and only if the initial screening assay is close to the optimal conditions further optimization is feasible. Yet, the two validations in hand are a proof of principle for our approach and even without further experimental validation the benchmarks indicated high-accuracy candidate sequences for 131 orphan enzymes, more than a third of the tractable enzymes stored in pathway databases.
Then to assess the impact of this expanded enzyme knowledge on systems biology, we compared the currently available genome-scale metabolic models with and without the addition of the orphan enzymes with high-confidence predictions. Subsequently, gene-knockout simulations showed that some genes considered to be essential in the current models became non-essential after the addition of the orphan enzymes. The addition of these orphan enzymes increased the accuracy of the models as all genes for which gene essentiality changed now agree with the experimentally determined essentiality status of the gene. Interestingly, several of the reactions for which the essential to non-essential predictions changed were reactions introduced by the automated gap-filling procedure during the reconstruction process. This observation suggests that the orphan enzyme reactions will not only influence the model simulations but also likely affect the gap-filling procedure, and thereby the reaction content of the final model, beyond simple addition of few new reactions. Taken together, the percentage of novel reactions, FCA and improved gene-essentiality predictions mean that our findings will improve the automatic as well as the manual reconstruction process for genome-scale metabolic models and applications thereof (Oberhardt et al, 2009
About 70% of the orphan enzymes in KEGG do not have pathway neighbours and are thus not amenable to our current pipeline (). However, in the future, our candidate gene identification pipeline could be modified to identify other genes that might be functionally related to the orphan enzymes through the integration of genome-scale functional data, such as gene lethality screens (Nichols et al, 2011
), genetic interactions (Costanzo et al, 2010
) or gene-expression profiles. This should enable one to retrieve candidate genes by searching the gene neighbourhood of the orthologs of these genes that are functionally related to the orphan enzymes. Furthermore, the current pipeline is only applicable to prokaryotic genomes. However, it could be extended to partially analyse fungal genomes as certain secondary metabolite pathways are known to be organized in gene clusters (Regueira et al, 2011
The linkage of sequences to these orphan functions implies that these functions can be utilized in genome-, transcriptome- and proteome-based methods. Here we illustrated the impact on genome-scale metabolic models. This benefit will be propagated into many different biological systems as these sequences will act as bait so that the newly sequenced genomes can be ascribed these functions through homology-based annotation methods. This is the first systematic approach to retrieve sequences for many orphan enzymes, and the developed computational framework can be applied to additional genomes and metagenomes as they get sequenced.