Metabolic pathways in eubacteria and archaea often are encoded by operons and/or gene clusters (genome neighborhoods) that provide important clues for assignment of both enzyme functions and metabolic pathways. We describe a bioinformatic approach (genome neighborhood network; GNN) that enables large scale prediction of the in vitro enzymatic activities and in vivo physiological functions (metabolic pathways) of uncharacterized enzymes in protein families. We demonstrate the utility of the GNN approach by predicting in vitro activities and in vivo functions in the proline racemase superfamily (PRS; InterPro IPR008794). The predictions were verified by measuring in vitro activities for 51 proteins in 12 families in the PRS that represent ∼85% of the sequences; in vitro activities of pathway enzymes, carbon/nitrogen source phenotypes, and/or transcriptomic studies confirmed the predicted pathways. The synergistic use of sequence similarity networks3 and GNNs will facilitate the discovery of the components of novel, uncharacterized metabolic pathways in sequenced genomes.
DNA molecules are polymers in which four nucleotides—guanine, adenine, thymine, and cytosine—are arranged along a sugar backbone. The sequence of these four nucleotides along the DNA strand determines the genetic code of the organism, and can be deciphered using various genome sequencing techniques. Microbial genomes are particularly easy to sequence as they contain fewer than several million nucleotides, compared with the 3 billion or so nucleotides that are present in the human genome.
Reading a genome sequence is straight forward, but predicting the physiological functions of the proteins encoded by the genes in the sequence can be challenging. In a process called genome annotation, the function of protein is predicted by comparing the relevant gene to the genes of proteins with known functions. However, microbial genomes and proteins are hugely diverse and over 50% of the microbial genomes that have been sequenced have not yet been related to any physiological function. With thousands of microbial genomes waiting to be deciphered, large scale approaches are needed.
Zhao et al. take advantage of a particular characteristic of microbial genomes. DNA sequences that code for two proteins required for the same task tend to be closer to each other in the genome than two sequences that code for unrelated functions. Operons are an extreme example; an operon is a unit of DNA that contains several genes that are expressed as proteins at the same time.
Zhao et al. have developed a bioinformatic method called the genome neighbourhood network approach to work out the function of proteins based on their position relative to other proteins in the genome. When applied to the proline racemase superfamily (PRS), which contains enzymes with similar sequences that can catalyze three distinct chemical reactions, the new approach was able to assign a function to the majority of proteins in a public database of PRS enzymes, and also revealed new members of the PRS family. Experiments confirmed that the proteins behaved as predicted. The next challenge is to develop the genome neighbourhood network approach so that it can be applied to more complex systems.