|Home | About | Journals | Submit | Contact Us | Français|
Insertions and deletions (indels) are important sequence variants that are considered as phylogenetic markers that reflect evolutionary adaptations in different species. In an effort to systematically study indels specific to the phylum Nematoda and their structural impact on the proteins bearing them, we examined over 340,000 polypeptides from 21 nematode species spanning the phylum, compared them to non-nematodes and identified indels unique to nematode proteins in more than 3,000 protein families. Examination of the amino acid composition revealed uneven usage of amino acids for insertions and deletions. The amino acid composition and cost, along with the secondary structure constitution of the indels, were analyzed in the context of their biological pathway associations. Species-specific indels could enable indel-based targeting for drug design in pathogens/parasites. Therefore, we screened the spatial locations of the indels in the parasite’s protein 3D structures, determined the location of the indel and identified potential unique drug targeting sites. These indels could be confirmed by RNA-Seq data. Examples are presented that illustrate the close proximity of the indel to established small-molecule binding pockets that can potentially facilitate selective targeting to the parasites and bypassing their host, thus reducing or eliminating the toxicity of the potential drugs. The study presents an approach for understanding the adaptation of pathogens/parasites at a molecular level, and outlines a strategy to identify such nematode-selective targets that remain essential to the organism. With further experimental characterization and validation, it opens a possible channel for the development of novel treatments with high target specificity, addressing both host toxicity and resistance concerns.
The phylum Nematoda is one of the largest and most diverse phyla on the planet. At least 25,000 distinct nematodes species have been described and it is estimated that the actual species count may go well into the millions (Hugot et al., 2001). Members of this phylum are found in hot springs, polar ice, and almost everywhere in between, and the lifestyles of these organisms vary from free-living to parasitic organisms (which are found in plants, vertebrates, insects, and even other nematodes). Plant and animal parasitic nematodes are of special concern because of their detrimental effect on the economy and global health. It is estimated by the WHO that 2.9 billion people are infected with parasitic nematodes (Hotez et al., 2007). In addition parasitic nematodes cost the agricultural industry more than $80 billion per year in crop treatment and lost product (Nicol et al., 2011). Currently anthelminthic drugs utilized to treat and prevent nematode infections are becoming less effective as drug resistance increases among populations (e.g., (Wolstenholme et al., 2004; Wrigley et al., 2006)). As resistance increases, drugs with novel mechanisms of action and/or alternate therapeutic approaches for control are needed to combat these parasites.
In the past decade, fast-evolving DNA and RNA sequencing technology has greatly enriched our understanding of many organisms (including many nematodes) from a genomic perspective. The rapid growth of genome information for nematodes has led to many in-depth studies of their genetics, genomics and functional evolution (Brindley et al., 2009; Dieterich and Sommer, 2009; Mitreva et al., 2007; Sommer and Streit, 2011). This genomic data can also be exploited to better understand parasite adaptations at a molecular level, and to facilitate the pursuit of novel treatments for prevention and/or control. Parasite genes or proteins are often examined in terms of their potential to serve as targets of new treatments for parasite control. There are two main groups of proteins that can be exploited for these purposes: i) proteins that are specific to the parasite or ii) proteins that are highly homologous between the parasite and the host, but have diverged sufficiently to enable selective targeting in the parasite. These two groups of potential targets are non-overlapping and potentially provide promising targets for the development of drugs with low toxicity to the host.
Previous studies have examined drug targets unique to the target organism in order to minimize or eliminate toxic effects to the host (e.g. (Galperin and Koonin, 1999)), but if conserved (i.e. non-unique) essential proteins are eliminated from the target pool, then only a small fraction of the proteins are left for further exploration. For example, in a study of Bacillus subtilis, 96% of essential genes were found to be conserved in other bacteria and nearly 70% were found to be conserved in Archaea and Eukaryotes (Kobayashi et al., 2003). This indicates that if only the proteins that are unique to B. subtilis are examined for potential drugs, most of the proteome would need to be excluded. The majority of the nematode-specific proteins remain so distinct from host proteins that extrapolation of distant homology based on protein folds can only be used to infer putative functions for ~10% of the novel proteins (Yin et al., 2009). Therefore, selecting nematode-specific proteins as drug targets requires extensive experimental characterizations of their functions. This is also reflected in the fact that few of the current anthelmintics are targeted against species-specific or nematode-specific proteins.
On the other hand, proteins that are essential and conserved in multiple species are likely to be involved in core cellular processes (Kobayashi et al., 2003). A set of 458 core proteins shared among most eukaryotes has been previously defined (Parra et al., 2007), and these could prove to be more effective targets than species-unique proteins. However, unless differentiated regions are identified in order to facilitate specific targeting, there is a possibility of high toxicity to the host. The differential regions within these proteins can range from single amino acid changes to the insertion or deletion (indels) of multiple amino acids (Thorne, 2000). Indels have been shown to have a greater effect on protein structure and function than single amino acid changes that result from substitutions (Hormozdiari et al., 2009; Salari et al., 2008), and can also create a unique ligand binding site on the protein surface (Studer et al., 2013). It has been shown that indels rarely affect the structural scaffold of a protein, but much more often alter peripheral elements (Studer et al., 2013), which may lead to changes in binding sites that facilitate specific ligand binding.
A study (Wang et al., 2009) identified important roles of indels in nematode adaptation, but the focus was on the relevance of the indels for evolutionary adaptations, so many aspects related to the structural impact of the indels were not investigated. Comparisons of the homologous protein structures in proteins in the Protein Data Bank, has shown that the location of indels in a protein occur in a non-random manner; specifically, they tend to be located in loop regions more frequently than elsewhere (Fechteler et al., 1995). In one study up to 85% of indels were found in coiled regions of proteins (Pascarella and Argos, 1992). Indels have also been shown to vary in composition from other sections of proteins (Hsing and Cherkasov, 2008), and tend to be enriched in amino acids with small side chains and flanked by highly structured regions (Wrabl and Grishin, 2004). In a large scale study of bacterial/human homologs, sizeable indels were shown to exist in 5–10% of bacterial proteins with human homologs, and this number is even larger (~25%) for some protozoan pathogens (Cherkasov et al., 2006). Study including model species (Bacillus subtilis, Escherichia coli, and Saccharomyces cerevisiae), with available high-quality protein essentiality data, has shown that indels are more prevalent in essential than non-essential proteins (Chan et al., 2007).
By utilizing information about indels, it is possible to rationally design a ligand/drug that specifically targets a conserved protein in one organism without interfering with its homologue in another species by binding at the unique site (“indel targeting” (Cherkasov et al., 2005)). Indel targeting was successfully performed for the elongation factor 1-α (EF1-α) protein in the pathogen Leishmania donovani, a virulence factor that allows the intracellular pathogen to persist in the human macrophage (Nandan et al., 2007). EF-1α has greater than 80% sequence identity with its human homolog, but a 12 amino acid deletion in the L. donovani ligand binding site was exploited to design small molecules that selectively bind to L. donovani EF-1α. Other more recent reports have also explored sequence diversification among host and pathogen homologous proteins and studied their therapeutic potential (e.g. (Fox et al., 2009; Jansen et al., 2013; Kerr et al., 2010; Urbaniak et al., 2013; Wang et al., 2012)).
In this report, we present a systematic approach to identifying novel drug targets that leverages existing sequence information to expand and improve our knowledge of proteins bearing nematode-specific indels. The study is based on sequence data from 21 different nematode species (over 340,000 polypeptides derived from whole genome sequencing), and investigates nematode-specific indels in nematode proteins and their underlying biological function, amino acid composition, cost, druggability, and location in the spatial structure. We present a few cases that demonstrate whether specific inhibitions could be achieved by selectively targeting the indel regions on the protein structures. Our methodical evaluation and results provide key information useful for the selection of proteins to examine further as new drug targets, based on indel-selective targeting.
The workflow for the systematic analysis in this study is shown in Figure 1. Protein sequences from 21 nematodes (both parasitic and non-parasitic) were examined. The complete proteome datasets comprises 348,635 proteins generated in nematode genome-sequencing projects (http://nematode.net (Martin et al., 2015), and Wormbase-Parasite (Howe et al., 2015), Table 1). They were compared with 386,017 proteins from 11 outgroup or host species. For each species, isoforms of these protein sequences were examined against the coding genes, and only the longest were kept when applicable. Protein families (orthologous groups) were defined utilizing the Markov cluster algorithm (Enright et al., 2002) using the OrthoMCL package (Fischer et al., 2011; Li et al., 2003) with an inflation factor 1.5, based on the final proteome datasets. Each protein family consists of at least two proteins from one or more species. Among them, the final dataset for nematode specific indel analysis were those protein family clusters (PFC) containing both NemFams (having sequences from at least 1 nematode species) and RefFams (having at least 1 non-nematode homologs). The NemFams and RefFams within each PFC were then split for the sequence alignments as discussed below.
The whole alignment and indel detection process was done following published protocol (Wang et al., 2009). Briefly, aligning the NemFam sequences with the RefFam sequences was a multi-step process. Within each PFC, the NemFam and RefFam sequences were first each aligned using MUSCLE (Edgar, 2004). Before performing the alignment with the NemFam sequences, any RefFam sequence in a RefFam group that deviated from the mean length by more than 30% was removed. Once the alignments of the RefFam/NemFam sequences were complete, they were then combined again and aligned using the profile alignment function of CLUSTAL-Omega (Sievers et al., 2011). After the profile alignment, the NemFam and RefFam sequences were again split into separate files and the NemFam sequences were further curated to improve alignment and reduce redundancy. Exons in nematodes tend not to exceed 100 AA in length (average length), thus any gap larger than 2 times the average length (200 AA) was removed from consideration as it likely to be an artifact due to disparity in polypeptide and protein size. Automated sequence alignment programs sometimes return alignments containing stretches of gaps intervened by very short stretches of AA sequences. For easier data analysis, they were combined into a single, long stretch of gap sequence. The flanking regions (10 AA upstream and downstream) of each reported gap were then examined, and only gaps with flanking regions that were comprised of at least 10 total AA (i.e., 50%) were kept. Other gaps were combined with their peripheral gaps into a single gap for downstream analysis (Figure 1). Additionally sequences that exhibited a poor alignment were removed. Any sequence that had a maximum pairwise percent identity less than 10% of the average percent identity of the entire alignment or less than 0.34 fraction of length was removed. Alignments were rerun for the PFCs with erroneous sequences. The resulting improved alignments were used for insertion and deletion detection as previously described (Wang et al., 2009). For the purposes of this study, a gap absent from the RefFam sequences was recorded as being a ‘nematode specific deletion’, while gaps present only in the RefFam sequences were recorded as a ‘nematode specific insertion’. A gap was determined to be ‘shared’ in sequences within a multiple alignment if the gap overlapped by more than one third of their total length or more than half of any individual gap. The length of a shared gap is the average length of the member deletion. A ‘background’ sequence is defined as the areas of the protein alignments not containing gaps.
For each PFC, all the protein sequences were screened against the KEGG database v70.0 (Kanehisa and Goto, 2000) to associate them with functions and corresponding pathways using KEGGscan (Wylie et al., 2008) (Table S1). The associated KEGG Orthology pathways (KOs) for each PFC were assigned based on the KOs of all the protein sequences, in a step-wise approach similarly as previously reported (Wang et al., 2015). Each PFC was then assigned into one of the five major KEGG categories (Metabolism, Environmental information processing, Cellular processes, Genetic information processing and Organismal systems) and their subcategories based on the pathways it participates. If one PFC participates more than one pathways falling into multiple KEGG (sub)categories, the subcategory with the most KO association is assigned as the final subcategory the PFC belongs to. The total numbers of indels possessed by the proteins with associated KEGG categories and the mean number of families associated with each pathway and pathway category were then calculated (indel rate, total indel/total family members).
RNA-Seq reads of Brugia malayi across multiple life-cycle stages were obtained from previous published work (Choi et al., 2011) and downloaded from Array Express (http://www.ebi.ac.uk/arrayexpress/, accession number E-MTAB-811). Analytical processing of the Illumina short-reads was performed using in-house scripts to filter out regions of low compositional complexity and to convert them into Ns. Subsequently Ns were removed and reads were discarded without at least 25 bases of non-N sequence. Contamination screening was also carried out to filter out standard contaminants (bacteria, human and ribosomes). Gene expression for each sample was calculated by mapping the screened RNA-Seq reads to the whole genomic DNA sequences using Tophat2 (Kim et al., 2013) (version 2.0.8) and calculating depth and breadth of coverage per gene using Refcov (version 0.3, http://gmt.genome.wustl.edu/gmt-refcov/).
The amino acid composition and cost, and the underlying structure of inserted, deleted, shared and background sequences were determined as follows. Sequences present in both NemFam and RemFam groups are defined as ‘background’. Regions with gaps in the sequence alignment for a subset of NemFam sequences as well as a subset of RefFam sequences were annotated as ‘shared’. Insertions and deletions for specific nematode sequences were compared to those of the rest of the NemFam group as well as the associated RemFam sequences.
Eight different methods to estimate amino acid biosynthetic cost (Barton et al., 2010; Craig and Weber, 1998; Heizer et al., 2006; Seligmann, 2003; Wagner, 2005) were used to estimate the difference in synthetic cost resulting from the indels in the study. The average cost was calculated by summing the cost of the individual amino acids in a position and dividing by the total number of amino acids. Amino acid composition was determined by counting the percentage of a specific amino acid appeared in a sequence.
The nematode species included in this study represented parasitic and non-parasitic nematodes, including both animal and human parasitic nematodes. Parasitic nematode proteins containing indels were aligned with known tertiary structures from the PDB using BLAST (threshold 1e-05, 35% identity at over 50% fraction of length) to 300,191 sequences (including multiple chains for a single PDB) to identify homologs. Secondary structure annotations were downloaded from RCSB PDB (Joosten et al., 2011) as annotated by DSSP (Kabsch and Sander, 1983).The druggability of each PDB structure was assessed using the ChEMBL DrugEBllity portal (Bento et al., 2014; Gaulton et al., 2012) which predicts the suitability of the binding site for small molecules. If a PDB chain is reported by the database to have a positive score (including any of tractable, druggable or ensemble score), it is labeled as a druggable PDB structure. The druggability of the nematode proteins was then determined based upon the BLAST match with the PDB sequence.
In each PFC, proteins having a PDB hit were also evaluated for the possibility of specific targeting at the indel locations (relative to its top PDB hit structure) using SiteHound (Ghersi and Sanchez, 2009), to identify any potential ligand binding sites. The NemFam sequences were mapped to the matching PDB sequences. If an indel was detected within 3 AA of any binding site identified by SiteHound, the indel was classified as a target site of interest.
Modeling of protein structures was carried out for three selected candidates using the I-TASSER Suite 2.1 (Roy et al., 2010) using default parameters. Alignments were based on the indel identification process above. These models were refined using NAMD following a published protocol (Phillips et al., 2005). Molecular dynamics was run with 10 separate trajectories of 1ns, and the last 100 ps of each were averaged to create the refined models.
In this study proteins from 21 nematode species were compared to 11 non-nematode reference species to identify indels that are specific to nematode proteins (Table 1). The overall workflow is presented in Figure 1 and details of the approach are presented in the Methods section.
Markov clustering of 513,419 nematode and reference proteins resulted in 50,298 homologous protein families, of which 35,922 had at least one nematode sequence (NemFams). Of these 7,102 NemFams had at least one homolog from the reference species (RefFam). Further alignment improvement resulted in the identification 6,423 protein family clusters PFCs (i.e. NemFam vs. RefFam sequence alignment) for further analysis. Out of these 6,423 PFCs, 4,158 were associated with biological pathways, with 3,892 PFCs matching the 5 main functional KEGG categories (see Methods). The sequences from 68,408 nematode proteins in these 3,892 PFCs were examined for indels specific to nematodes (Figure 1).
The number of deletions (70,704) observed was approximately 1.7 times higher than the number of observed insertions (41,062). On average, deletions were significantly longer than insertions (p-value < 2.2e-16, 19 vs 11 AA) and there was a higher frequency of long deletions than insertions (>10 AA; Table 2).
A detailed analysis of amino acid usage in insertions, deletions, shared and background sequences is summarized in Table 3 and Figure 2. Compared with insertions and shared sequences, deletions are highly enriched in seven amino acids (F, I, L, V, C, W, and Y), notably including the three amino acids with the highest synthetic cost (Aglucose; F, W, and Y). Most amino acids in insertions appear with lower frequency than background except a few: D, E, N, Q, G, P, S, and T. Among them, only T is auxotrophic in nematodes. Shared sequences almost always have an amino acid composition somewhere between deletions and insertions, as expected. The distribution of amino acids for shared gap regions were always between that of insertions and deletions.
The eight methods used to calculate biosynthesis cost (Barton et al., 2010; Craig and Weber, 1998; Heizer et al., 2006; Seligmann, 2003; Wagner, 2005) are all highly correlated with amino acid composition, so pairwise comparisons of the average costs for the four categories generated by these methods all show similar patterns (Table S2). We report one set of representative results in Table 3, which was obtained from a recently developed systems biology approach based on genome-scale metabolic models (Aglucose (Barton et al., 2010)). Insertions tended to incorporate amino acids with the lowest average biosynthetic cost (0.940) while deletions possess amino acids with the higher cost (0.981). Shared gap regions had an average cost between the insertions and deletions (0.951), and the cost in background sequences was the highest among the four categories examined, at 0.994. Overall the cost of the amino acids essential in nematodes (auxotrophs (Barrett, 1991)) was higher compared to the cost of the nonessential amino acids (Table 3 and Table S3).
The secondary structure of the PFC proteins was determined by comparison to the PDB entries. In the aligned sequences, deletions have higher percentage of ordered structures than insertions (especially for the abundant structural categories α-helix and β-strands, and insertion regions have higher portions of bend, turn and loops (coils)(Table 4). Again, the compositions of the secondary structures for the shared sequences fall in between the values for insertions and deletions, while background sequences have a composition more similar to deletions.
Selective pressure can vary according to the pathway in which a protein functions, and this difference in selective pressure may result in a distribution of insertions and deletions that varies according to the biological pathway. The frequency of insertions and deletions was examined in five KEGG pathway categories (Table 5). PFCs that were identified as being involved in ‘Genetic Information Processing’ had the lowest frequency of both sizable (> 4 AA) and all insertions and deletions (Table 5 and Table S1), while ‘Environmental Information Processing’ had the highest frequency of deletions, and ‘Organismal Systems” had the highest frequency of insertions. The rates of sizable insertions/deletions show almost exactly the same trend in each category as indels of all sizes.
Among the previously identified 34,002 proteins from parasitic species (2,821 PFCs) with a match in the PDB, 13,396 proteins (1,423 PFCs) are identified as druggable. The vast majority of them (12,719 proteins in 1,409 PFCs) contained nematode-specific indels. The distribution of these druggable proteins in the KEGG categories follows the overall distribution of indel bearing proteins, with majority being involved in ‘Metabolism’ (Figure 3, Table 6).
Indels have been suggested to serve as candidates of pathogen-specific drug targeting to reduce the likelihood of host toxicity. In our approach, we compared all members of each PFC to the matching PDB structures, and determined proximity to predicted binding site residues. An indel at the immediate periphery of a ligand binding site could alter the size and residue locations of the site, sometimes creating a unique binding site compared to the homologous host proteins. Approximately 70% of the PFCs (2,843 out of 3,892) have at least one protein hitting a PDB structure. In about 30% of the PFCs (1,141), at least one protein had been identified with an indel close to a potential ligand binding site. Below we describe three examples to illustrate how these indel sites could be exploited to design specific ligand to achieve selectivity.
Indel frequency has been shown to vary across different organisms. Studies of genetic variation in Drosophila melanogaster and Caenorhabditis elegans have shown that indels represent between 16% and 25% of all genetic polymorphisms in these species (Berger et al., 2001; Wicks et al., 2001). It is estimated that human populations typically harbor a minimum of 1.56 million indels (Mills et al., 2006). Not only does the frequency of indels vary, but in general, deletions are more prevalent than insertions. In a recent study examining 5,000 indel events in noncoding regions of 17 taxonomic groups across the three domains of life (Kuo and Ochman, 2009), deletion events outnumbered insertions in all groups. Deletions also outweighed insertions more in prokaryotes compared to eukaryotes. In the current study, we also found this to be the case for nematodes compared with the reference sets at the deduced proteome level. Deletions were ~1.7 times more abundant than insertions in proteins found to be associated with the five examined KEGG categories. We showed associations with different modes of existence and uneven functional evolution. However, given the draft nature of the available genomes when alternative splicing or exon skipping information becomes available for nematode species in the future (at present genome-wide alternative splicing isoform information is not available for any parasitic nematodes but only for the non-parasitic C. elegans) our findings could be refined.
In this present study, we focused on the structural categories the indels fall into, by comparing the nematode proteins with known protein structures to further understand their impact and the consequences in novel drug discovery. The highly specific indel content within nematodes are reflected in their overall amino acid composition. We observed that both deletions and insertions were comprised predominantly of amino acids with small side chains and high turn propensity, such as G, P, S, N and Q. In addition, insertions are also enriched in the two hydrophilic residues D and E. Notably, none of these amino acids are auxothropic in nematodes. Compared to background, both deletions and insertions tended to be depleted in hydrophobic amino acids (27.60% and 23.67% vs. 29.10%) but enriched in ambivalent amino acids (37.32% and 39.54% vs 35.61%). Deletions are also slightly depleted in hydrophilic amino acids than the background (35.07% vs 35.29%), while insertions are enriched for them (36.79% vs. 35.29%). These results are in line with what have been reported in previous work (Roth and Liberles, 2006) and other indel databases such as IndelFR (Zhang et al., 2012) and IndelPDB (Hsing and Cherkasov, 2008) (Table S5). In those databases, gaps were limited to short length (99.9% of gaps were < 100 AA, while 90% of gaps were < 10 AA long) in contrast to our longer allowed gap length. Also, the observations in those databases are limited to the protein structures in highly homologous species, so there may be some differences for a few individual amino acids.
Deletions have higher average biosynthetic cost than insertions due to the higher portion of residues with large side chains such as F, W, and Y. This suggests that setting aside the penalties of gap opening/closing, the cost for extending a gap region on a residual basis will be lower than the background sequences to compensate for the cost of opening it. The average cost ratio (using eight different methods) of essential (auxothrophic) to nonessential amino acids in nematodes is 2.27 (Table S3). Our analysis only considers the cost differences for indels in parasitic nematodes, and does not include differences with either the parasite hosts or free-living species. Such analysis may provide insight into the nature of parasitism, since differences in amino acid usage between parasites and their free-living relatives may be the result of parasitic adaptation.
Indels have been shown to vary not only in amino acid composition but also in structural constitutions. We found that for insertions, the majority of amino acids did not align with any portion of a PDB protein and thus it was impossible to directly determine the underlying structure. As loop regions of proteins are often not resolved in crystal structures and commonly have lower sequence identity, it is expected that many indels occur in loop regions. In contrast, deletions have significantly higher portions of sequences aligned with PDB structures. In the aligned deleted regions, about ¼ of the sequences adopt loop or random coil conformations, which is just slightly less than the portions of α-helical conformations in all secondary structure categories, while in the aligned inserted regions, loops take as much as 40% of the structures, further supporting the idea that loops play an important structural role for nematode specific indels.
There was a biased distribution of indel events within KEGG pathways, which is likely due to differences in selective pressure. Proteins that were associated with ‘Environmental information processing’ had the highest number of sizable insertions and deletions per PFC, while proteins that were involved in ‘Genetic information processing’ had the lowest average number of insertions and deletions per cluster, as previously reported (Wang et al., 2009). KEGG subdivides the category ‘Environmental information processing’ into three main subcategories (membrane transport, signal transduction, and signaling molecules and interaction), and it has been previously shown that mutations in proteins involved with signal transduction can result in an increased longevity and stress resistance (Longo, 1999). It is possible that the indels in these proteins occur as a result of positive selection. Proteins in ‘Genetic information processing’ have stringent selective constraints, and are under strong negative selection to preserve their functions (Bergmiller et al., 2012). Accordingly, these proteins have the least number of insertions and deletions per PFC.
Out of the 34,002 parasitic proteins with a match in PDB, 13,396 of them were identified as druggable and 12,719 were druggable parasite proteins with indels. Over half (7,058 out of 12,719, 55%) of these proteins were classified as being involved in metabolism. This suggests that it is possible that further optimization of the candidate compounds based on indel information may result in new approaches to control or prevent nematode infection.
As an example from the ‘Metabolism’ category, prostatic acid phosphatase (PAP, EC: 126.96.36.199) is a ubiquitous lysosomal enzyme that hydrolyses organic phosphates at an acid pH (Muniyan et al., 2013), with 4 structures of the human protein available. Indel location analysis identified a 2 amino-acid deletion specific to the human filarial nematode Brugia malayi at the immediate periphery of its active site (Figure 5A). Using gene expression data we confirm the presence of this deletion and expression and relevant stages (i.e. parasitic stages). The mRNA sequences from multiple RNA-seq libraries from different developmental stages also validated the sequences flanking the gap regions, and showed that the expression of the protein is almost ubiquitous across multiple development stages with highest expression in adult stages (Table S4, Figure S1), hence validated the existence of the indel and providing expression profile of the indel bearing gene. Furthermore, to explore the consequence of the indel on the nematode protein structure, we built a homology model for PAP of B. malayi based on its alignment against the human structure (PDB code: 1ND5). The PAP protein structure from H. sapiens shares ~33% sequence identity with that of B. malayi. In comparison with the human crystal structure, the deletion clearly created a larger pocket (Figure 5B, C, and D), and a non-selective inhibitor could potentially be modified to be more specific at the active site of PAP for B. malayi.
Another example is the NemFam-encoding, retinoic acid-related orphan nuclear hormone receptor (ROR) in the ‘Signaling molecules and interaction” category. In the host species, the RORs are involved in many physiological processes, including regulation of metabolism, development and immunity as well as the circadian rhythm (Kojetin and Burris, 2014). In C. elegans, ROR is required in all larval molts and the hypodermal expression of other proteins essential for larval development and adult morphogenesis (Kostrouchova et al., 2001). As shown in Figure 6A, our sequence alignment reveals a small, 2 aa insertion in the ligand binding domain (LBD) present in almost all of the parasitic nematode species within the NemFam. Using the human whipworm species Trichuris trichiura as an example, a homology model based on the human template (PDB code: 1N83) shows that the insertion results in the intrusion of the N-terminus of H7 into the tightly packed binding pocket (Figure 6B–D). The bound cholesterol in the crystal structure interacts with this region, so even this small change in AA composition could potentially be used to design parasite-specific ligands.
With drug resistance and environmental concerns rising, there is an urgent need for new anthelmintic therapeutics In looking for new drug targets in parasites, two groups of protein candidates are of special interest, i) targets that are unique to the pathogen, avoiding proteins that were evolutionarily conserved between hosts and pathogen to reduce toxicity or ii) targets that share homology with the host proteins (essential proteins) that possess molecular features (such as indels) specific to the pathogen that enables selective targeting. In recent years, steady progress in genome sequencing projects has generated large amounts of genomic data for nematodes and provided an abundance of resources to study the evolution, adaptation, and unique features of nematode proteins, especially for parasites. Indel analysis (in combination with other approaches such as druggability analysis and structural and functional annotations) at a genome-wide scale provides a systematic method of identifying novel potential drug targets. Classification and understanding of indel location, structure, and composition is important, as it provides information on specific events that improve our understanding of protein evolution, and it allows researchers to take advantage of such an event in approaches such as selective targeting. By identifying and selectively targeting these structurally unique regions with small molecules, the method promises to open the door to a whole new standard for antihelmintic drug discovery.
We developed and applied a systematic approach for identifying, analyzing and evaluating specific indels present in the phylum Nematoda (in comparison with their host organisms) in order to understand the unique structural features of the indels. By scanning the indel locations for the parasitic druggable proteins in each cluster with its corresponding PDB structure, we were able to narrow down to about 20% of these proteins with interesting indel target sites. Because of their uniqueness resulting from various lengths of the gaps and 3D conformations of the cavities, not all sites may be feasible targets for small molecules. However, the results indicate that indels could indeed often be located at critical regions of proteins, hypothetically creating novel ligand binding sites through the alteration of the shapes and amino acid compositions of these sites. Among these, we presented three examples of indels in the binding sites of nematode proteins compared to those of the hosts. In each example, the indel creates a structural change in the binding site which may be a exploited to design small molecules capable of specific binding to the nematode target.
Future studies of the indel bearing proteins identified and characterized in this communication may improve our understanding of protein evolution in parasites (and nematodes in general), and may lead to new drug targets, anthelminthic drugs, and new strategies to control these parasites of global importance.
Figure S1. RNA-Seq mapping results of B malayi PAP gene. The gap region in Figure 5 is marked by the vertical line.
The authors thank Zhengyuan Wang for assisting with technical issues relation with indel identification and John Martin for technical assistance with HMM model building. This work was supported by the National Institute of Health NIAID (R01 AI081803) and NIGMS (R01 GM097435) to M.M.. We thank the parasite genomics group at the Wellcome Trust Sanger Institute for making some of the unpublished reference genome used in this study available at WormBase-ParaSite.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
The orthologous groups, the proteins, their annotation, indel location and related features have been made publicly available through an interactive web interface at http://nematode.net (http://nematode.net/indels.html).
Authors’ contributionsQW, EH and MM conceived and designed the experiments. QW, BAR and EH carried out experiments and analyses. QW, EH, BAR, SAW, JWJ and MM interpreted results and prepared the manuscript. All authors have read and approved the final manuscript.