Genome sequencing has revealed that the secondary metabolite potential of even well studied bacteria has been severely underestimated
[1],
[2]. This revelation has led to an explosion of interest in genome mining as an approach to natural product discovery
[3],
[4],
[5],
[6],
[7],
[8]. Considering that natural products remain one of the primary sources of therapeutic agents
[9],
[10], sequence analysis provides opportunities to identify strains with the greatest genetic potential to yield novel secondary metabolites prior to chemical analysis and thus increase the rate and efficiency with which new drug leads are discovered. In addition, community or metagenomic analyses can be used to identify environments with the greatest secondary metabolite potential and to address ecological questions related to secondary metabolism. To capitalize on these opportunities, it is critical that new bioinformatics tools be developed to handle the massive influx of sequence data that is being generated from next generation sequencing technologies
[11].
Polyketide synthases (PKSs) and non-ribosomal peptide synthetases (NRPSs) are large enzyme families that account for many clinically important pharmaceutical agents. These enzymes employ complimentary strategies to sequentially construct a diverse array of natural products from relatively simple carboxylic acid and amino acid building blocks using an assembly line process
[12],
[13]. The molecular architectures of PKS and NRPS genes have been reviewed in detail and minimally consist of activation (AT or A), thiolation (ACP or PCP), and condensation (KS or C) domains, respectively
[14],
[15],
[16],
[17],
[18]. These genes are among the largest found in microbial genomes and can include highly repetitive modules that create considerable challenges to accurate assembly and subsequent bioinformatic analysis
[8].
When the challenges associated with PKS and NRPS gene assembly can be overcome, a number of effective bioinformatics tools have been developed for domain parsing
[19],
[20] and domain string analysis
[21],
[22]. In cases of modular type I PKSs and NRPSs where domain strings follow the “co-linearity rule” such that substrates are incorporated and processed according to the precise domain organization observed in the pathway, bioinformatics has been used to make accurate structural predictions about the metabolic products of those pathways
[23]. However, the increasing number of exceptions to co-linearity, such as module skipping and stuttering
[24], create limitations for precise, sequence-based structure prediction. The bioinformatic tools currently available for secondary metabolism have been reviewed
[25],
[26] and are complemented by the recent release of antiSMASH, which has the capacity to accurately identify and provide detailed sequence analysis of gene clusters associated with all known secondary metabolite chemical classes
[27]. While all of these tools have useful applications, NaPDoS employs a phylogeny based classification system that can be used to quantify and distinguish KS and C domain types from a variety of datasets including the incomplete genome assemblies typically obtained using next generation sequencing technologies. These specific domains were selected because they are highly conserved and have proven to be among the most informative in a phylogenetic context
[28],
[29].
Phylogenomics provides a useful approach to infer gene function based on phylogenetic relationships as opposed to sequence similarities
[30],
[31]. While the evolutionary histories of PKS and NRPS genes are largely uninformative due to their size and complexity, KS and C domain phylogenies reveal highly supported clustering patterns. These patterns have been used to distinguish type II PKSs associated with spore pigment and antibiotic biosynthesis
[32], type I modular and hybrid PKSs
[33], and subsequently to identify many different PKSs types
[34]. KS phylogeny has also been used to predict pathway associations
[26],
[35] and, in some cases, the secondary metabolic products of those pathways
[28],
[36],
[37]. Phylogenetics has also been used to successfully identify PKS sequences from complex metagenomic datasets
[38]. Likewise, C domain phylogeny clearly delineates functional subtypes as opposed to species relationships
[39] and has been used to identify new functional classes, such as the “starter” C domain
[29]. Taken together, the established phylogenetic relationships of KS and C domains provide an effective framework within which to assess secondary metabolite gene richness and diversity and to identify new functional classes that may be associated with uncharacterized biosynthetic mechanisms.
Here we introduce the web tool Natural Product Domain Seeker (NaPDoS), which extracts and rapidly classifies KS and C domains from a wide range of sequence data. The results can be used to assess the potential for PKS and NRPS secondary metabolite biosynthesis in organisms or environments and to identify new phylogenetic lineages, which can subsequently be investigated as a source of new mechanistic biochemistry. We tested NaPDoS on four draft bacterial genome sequences and two metagenomic datasets. The results reveal a remarkable level of secondary metabolite gene diversity among closely related strains and provide a mechanism to assess secondary metabolism from poorly assembled genomic data.