DMT families are present in diverse kingdoms and phyla
To ascertain the distribution of DMT families [Table ] in a diverse set of kingdoms and phyla and to obtain a representative set of protein sequences for further analysis, we selected a dozen model organisms - 9 animals, 1 fungus, 1 Amoebozoa
, and 1 plant [additional file 1
: supplementary table S1] - and then searched the proteomes of each for matches to the ten human DMT families [additional file 2
: supplementary table S2]. This was done using standard Pfam tools (see Methods). Within the organisms selected, the representation of families ranges from high (e.g. EamA/Zip/Cation efflux families each have ~10 proteins per species) to low (e.g., UPF0546 ~1 protein per species). All sequences are listed in [additional file 3
: supplementary table S3].
Number of DMT sequences returned in Pfam mining
For each DMT family, the members found in the 12 selected organisms were aligned at the protein level using MAFFT-EINSI [28
]. Manual editing was then performed to realign/remove poorly aligned sequences using Maxalign (v1.1) [29
], with the additional goals of retaining all human protein sequences and retaining regions with conserved transmembrane characteristics (defined here as regions in which over 80% of the sequences are predicted to be transmembrane according to Phobius, a leading transmembrane topology predictor [30
]). The alignments can be found in the additional material (additional file 4
Domain architecture of human DMT families
The conception of a "domain" as a stably folding protein segment is not technically applicable to membrane-inserted proteins, and we will in this paper use "domain" to describe a part of a membrane inserted protein that has an independent evolutionary history [31
Seven of the human DMT families were cut into N-terminal and C-terminal domains by symmetry. Cutting by symmetry means that we divide a 10 TM family as 5 + 5 TM structure, or that it is a single domain family that will not be subject to cutting. When we refer to 'single domain' DMT families, this can either mean that the domain tends to exist in single copy inserted in a longer sequence (e.g. TPT) or that the whole sequence is only that domain (e.g. UPF0546).
The three remaining ('asymmetric') human families that could not have domains defined in this straightforward way are: DUF803, Cation efflux, and Zip. In the case of Zip, there is considerable length of aligning sequence after the last TM block that is part of Pfam's Zip domain.
Canonical example sequences of these anomalous ('asymmetric') families were analyzed with DLP-SVM (an SVM that recognizes domain linker peptides) and TMHMM [Figure ], suggesting a 4 + 5, 4 + 2, and 3 + 5 TM architecture in relation to the DMT domain border, respectively. Finally, placement of the domain boundary in the Zip and Cation efflux families was determined by the concurrent location of Pfam low complexity regions as well as a large gapped region in the alignment, and in the case of DUF803 by use of a generic HMM recognizing the DMT-1 and DMT-2 domains. Tables S4 [additional file 5
: supplementary table S4] and 2 list different evidence type used to locate domain border and the final conclusion for each family.
Figure 1 Figure showing TM structure in relation to overlaid DLP-SVM prediction. The figure shows TMHMM transmembrane predictions and DLP-SVM predictions for example sequences representing human asymmetric DMT families: NIPA1 (DUF803), SLC30A1 (Cation efflux), (more ...)
Resolved dendrograms of DMTs identify stable subfamilies in EamA, Cation efflux, and Zip
Using the knowledge of the domain architecture, we then extracted just the first domain (here termed DMT-1) from the overall alignments obtained previously. We made 10 DMT dendrograms, such as the one shown in Figure , using the DMT-1 domains with RAxML [32
]. We resolved the trees [additional file 6
: supplementary figure S1; figure ], i.e. ensured that they do not contain any nodes with bootstrap support <50%, using tools to edit the bootstrap forests (Methods; additional file 7
: human_dmt-1.dendr.tgz). The number of sequences in the resolved tree is smaller than the original number of sequences because of the editing process [additional file 8
: supplementary table S5].
Figure 2 Figure of edited EamA first domain maximum likelihood bootstrap forest. The red boxes indicate independent branches discovered in the edited dendrogram (AMAC, SLC35C/E, SLC35F). The oldest model organism sequence is indicated with an asterisk. Notable (more ...)
The organisms found in the branches can in theory be treated as markers of the age of the branch system, but due to the possibility of excessive sequence deletion or polychotomous tree formation, these trees should be viewed as hypothesis generating material, rather than accurate date estimates.
Only three of the dendrograms are found to contain stable independent branches that exist in at least 6 organisms: EamA, Zip (PF02535), Cation efflux (PF01545). The EamA, Zip, and Cation efflux dendrograms each contain four distinct branches [additional file 9
: supplementary table S6]. In the other seven families the low number of sequences in the analysis limits the number of identifiable branches.
In Cation efflux and Zip, the SLC39A11 and the so called 'chicken-specific branch' were formed in or before T. adhaerens
. This could be related to the ion transport needs of the proto-synaptic system of T. adhaerens
, and the subsequent emergence of a primitive nervous system in N. vectensis
[additional file 1
: supplementary table S1].
Visualization of the similarity relationships between nucleotide sugar transporter domains reveals the key role of the EamA family
Before further detailed analysis of the individual family dendrograms, we sought to identify which DMT family was the most likely origin of the nucleotide sugar transporters. The aim was to find which of the DMT families having a nucleotide sugar transporter function displayed most similarity to the other nucleotide sugar transporters with respect to its DMT-1 and DMT-2 domains.
To obtain a quantitative measure of similarity, we trained HMMs on each domain halve of the nucleotide sugar transporter families (EamA, TPT, DUF914, UAA, NST), and then found the similarity between every pairwise combination of domain HMMs using the HHsearch program. The HHsearch probability [33
] values indicate the probability that the two HMMs are significantly related. Two complementary methods were then employed to visualize the relationships embedded within that matrix.
Non-metric multidimensional scaling [34
] was used (see Methods) to construct a two-dimensional representation of the similarity data (Figure ) in which the domains are positioned so that the distances between them reflects as much as possible the original dissimilarity values. The resulting configuration shows a striking bipartitioning of the domains that recapitulates whether they are first or second domains. Furthermore, the domains nearest the notional "boundary" between the two clusters are EamA-1 (DUF6-1) and EamA-2 (DUF6-2). Our interpretation of this is that it suggests that EamA was the original family from which the other four nucleotide sugar transporter families evolved.
Figure 3 Multidimensional scaling analysis and distance analysis of EamA-derived nucleotide sugar transporters, first and second domains. Figure 3A: Two-dimensional representation of the similarity relationships between the domains of the nucleotide sugar transporter (more ...)
The second method was to select the most highly similar pairs of domains, connect them by "edges" in a graph, and seek a visual representation of that graph. We used 99.05% HHsearch similarity as a threshold to define edges, because empirically 99.05% is the lowest threshold we could set and still obtain a connected graph between all ten nucleotide sugar transporter domain halves. A 99%-cutoff translates to a p-value of 5.20E-18, in the HHsearch conversion.
The result (Figure ) was a planar graph (v-e + f = 2), where v is the number of vertices, e the number of edges and f the number of faces. It can be seen from Figure that the EamA (and TPT) family nodes have the highest degree: d(EamA-1) = 5, d(EamA-2) = 6, d(TPT-1) = 3, and d(TPT-2) = 5. These results suggests that the nucleotide sugar transporters may have differentiated from EamA.
In addition, if we plot the closest DMT-2 neighbors of the DMT1s and vice versa, we note that this also results in a graph (data not shown) with highest degree assigned to EamA/TPT. The closest DMT-2 neighbor of NST-1 is TPT-2 (98.1%), and the closest DMT-2 neighbor of DUF914-1/UAA-1 is EamA-2 (98.7 and 97.7% HHsearch prob.). The closest DMT-1 neighbor of NST-2/UAA-2/DUF914-2 is EamA-1 (95.2, 98.5, and 98.5% HHsearch prob.). Our interpretation of these results is that the nucleotide sugar transporters have differentiated from EamA.
Comparing inter-domain and EamA distances of human 5+5 TM structure DMTs reveals likely ancestral position of EamA
We produced a "divergence process table" [Table ], using the values from HHsearch for the first and second domains of DUF914, EamA, NST, TPT, UAA families. The table shows the increasing distance from 0.6 to 96.6%, measured in units of 100-HHsearch probability, between the first and second domains, as well as the increasing distance to EamA from zero to 2.4%, measured in units of 100-HHsearch probability. Note that the ordering of the interdomain distances exactly replicates the ordering of the distance from EamA, and that the NST interdomain halve similarity is surprisingly low (96.6% expressed as distance, despite close relationship with EamA).
Divergence process table of domain evolution from EamA
Thus the data is consistent with a scenario in which the nucleotide sugar transporters were formed from a duplication event in EamA. Next we examine the HMMs for the human 5+5 TM nucleotide sugar transporters and find recurrent motifs, before returning to the resolved dendrogram for EamA to study its component sequences.
Using the consensus sequences of HMMs for human EamA-derived 5+5 TM DMTs to study sequence evolution identifies G-X(6)-G motif
Table shows the matching TM segment-internal residues for pair-wise comparisons of HHsearch consensus sequences of pairs of DMT domain halves that are presumed, from Results, to have evolved from each other. A recurrent motif is identified in the 5th TM helix of the second DMT domain, G-X(6)-G. It appears to have been lost in NST, but exists in both the first and second domains of EamA, and in the second domain of TPT, DUF914, and UAA. The consensus sequence residues in the HHsearch models have a conservation of ~33% (depending on the amount of gaps - personal communication with Johannes Söding), but could be much higher. Inspection of the EamA alignment (additional file 4
: dmt.aln.tgz) reveals that the conservation of these glycines are 81% and 68%, respectively, in the glycine-6-glycine motif, making it the most conserved intra-helical motif in the EamA alignment. A MEME motif search returned G-X(6)-G as a constituent of a motif, having the regular expression K[VI][VL]GT[LI][VLI][CS][VI][GA]GAL[VL][ML]T[LF]YKGP and the e-value 3.0e-338.
Comparison of consensus sequence in first and second domain of EamA-derived 5+5 TM structure DMTs
Oldest model organism sequence in subfamilies in the edited EamA bootstrap forests reveals origin in Animalia of AMAC and SLC35C/E subfamilies
Both versions of the EamA tree contain the AMAC and SLC35C/E clusters, suggesting that the removal of sequences does not affect the main subfamilies of EamA. We selected EamA for exact dating of independent branches because it is established in Results that it is one of three human DMT families that display distinct branches, and in Results that EamA is the origin of human 5 + 5 TM structure DMTs. The following EamA subfamilies were found: AMACs, SLC35Fs, SLC35C/Es and PUPs. The subfamilies are named from their human sequence constituents.
The oldest model organism represented in the AMAC and SLC35C/E branches are respectively: D. discoideum (sequence [Dictybase:DDB0184471]) and T. adhaerens (sequence [JGI:e_gw18.104.22.168]). Because excessive editing of the bootstrap forests could result in erroneous deletion of ancient organism sequences, the steps 2.8-2.10 are undertaken to confirm these results. The oldest model organism sequence of the SLC35Fs is an A. thaliana sequence [TAIR:AT4G32140.1-P], thus making it impossible to date the SLC35F subfamily using this model organism selection.
We used third party annotation (TPA) in DNA databank of Japan, DDBJ, to supplement annotation to the oldest model organism sequence in the independent branches of the resolved bootstrap forest of EamA: D. discoideum TMEM20 (sequence [Dictybase:DDB0184471]) was super-annotated as [DDBJ:BR000891], and T. adhaerens SLC35E1 (sequence [JGI:e_gw22.214.171.124]) was super-annotated as [DDBJ:BR000889].
TBLASTN analysis confirms age of subfamilies in the resolved EamA dendrogram
We extend the analysis to a new set of 12 species (9 new species, T. adhaerens
, D. discoideum
, and A. thaliana
; see Figure ), because a dozen species obviously represents a rather limited sampling of species, making our ability to resolve the time of emergence of any subfamily limited. We compared TimeTree divergence time data (measured from H. sapiens
) for the new set of 12 species to the TimeTree divergence time data from the old set of 12 species (reporting the difference between the corresponding ordinal measurements in the right hand margin of Figure ) [35
Figure 4 TBLASTN 2.2.24+ confirmatory search of oldest model organism sequence identified in subfamilies in edited EamA dendrogram. Using TBLASTN 2.2.24+, the human sequences of TMEM20 and SLC35E1, representing the "AMAC" and "SLC35C/E" subfamilies, are used as (more ...)
A TBLASTN 2.2.24+ querying of the human counterparts (human TMEM20 and human SLC35E1) of the oldest model organism sequences against the "nr" database (limiting the search to one model organism at a time) show that D. discoideum TMEM20 and T. adhaerens SLC35E1 sequences are indeed the oldest constituents in their respective subfamilies [Figure ], because there is a sharp drop (to ~1/3 of its score) in sequence quality of hits in model organisms with a higher divergence time to H. sapiens than the oldest model organism sequences.
SLC35C1, TMEM22 are absent in Viridiplantae
We took the other human constituents of the branches, i.e. SLC35C1, TMEM22 and queried them against "nr" plants (taxid:3193); the quality of the hits is 45-57 (E-values: 4E-04 and 5E-09), with a coverage of 40-67% for the respective best hit, for H. sapiens SLC35C1 [Genbank:AAH01427] and H. sapiens TMEM22 [GenBank:AAH22557]. AMAC is not represented in plants at all. These hits are not of comparable quality to those obtained between the human sequence and the oldest model organism sequences identified above, where we had scores above 100 and cover the full-length sequences. Thus, it appears that our oldest model organism sequence from the branches found in EamA in the hypothesis generating section of the study agrees with the result from BLAST-based method.
Molecular clock analysis of eukaryotic EamA branches estimates SLC35E1 and TMEM20 emergence to 779 and 1567 Mya
We determined molecular clock distances between the oldest model organism sequence in the AMAC and SLC35C/E subfamilies of EamA and the respective orthologue sequences in human. The simpler (clocklike) tree is rejected on a significance level of 5%. The following distances were obtained: TMEM20 (1.79 units from oldest model organism sequence to human orthologue in the AMAC subfamily; this measurement refers to a pairwise sequence comparison), SLC35E1 (0.89 units), and SLC35F5 (1.86 units - as a reference measurement). These distances theoretically correspond to the divergence time estimate of H. sapiens-D. discoideum (1628 Million years ago; Mya), H. sapiens-A. thaliana (1628 Mya - using the Fungi/Metazoa group/Viridiplantae divergence time estimate), and H. sapiens-T. adhaerens (1009 Mya), using TimeTree divergence time estimates. Thus, using the SLC35F5 distance as a reference, the Mya estimates for the other sequence pairs are: TMEM20 (~1567 Mya) and SLC35E1 (~779 Mya). The divergence time of these sequences is congruent with an emergence of the AMAC and SLC35C/E subfamilies in Animalia, rather than in Viridiplantae.
Distribution in plants and bacteria provides strong evidence for EamA's ancestral position in nucleotide sugar transporter evolution
To gain a wider perspective on the evolution of the entire DMT clan, we also investigated the representation of all of the 19 DMT families listed in Pfam (i.e. the 10 families that have human members, and the other 9 families that do not). Using a separate extraction of 10+9 DMT families [additional file 2
: supplementary table S2; additional file 10
: supplementary table S7] in plants and bacteria, we compare the distribution of the number of sequences found [Table and additional file 11
: supplementary table S8].
Sequence retrieval of 19 DMTs in plants
The distribution of sequences between DMT families in plants and bacteria differs dramatically (compare Table and additional file 11
: supplementary table S8). In plants, the human DMT families we have discussed earlier are present in large numbers, using the sequence retrieval procedure presented in the methods section: TPT (512 sequences in plants), DUF914 (58 sequences), UAA (368 sequences), NST (75 sequences), DUF803 (90 sequences). In bacteria we obtain the following counts for the same families: TPT (3 sequences), DUF914 (1 sequence), UAA (0 sequences), NST (1 sequence), DUF803 (35 sequences). The DMT families not represented in human, presented in Table S7 [additional file 10
: supplementary table S7], display the opposite distribution pattern, having large numbers in the bacteria and small numbers in plants. For details about the sequences used to analyze DMT families not represented in human, see Table S9 [additional file 12
: supplementary table S9] and the additional alignments (additional file 4
The number of EamA sequences differs between the different plant species: poplar, maize, rice, and grape have more than 100 EamA proteins per species. Moss, algae and spruce have 20-50 copies per species. Peas, beans, and grass have less than 10 EamA proteins per species. Purple false brome, which is a monocot, has zero copies of EamA, using the sequence retrieval in Methods. The observation that EamA is the only 5+5 TM nucleotide sugar transporter that is present in both prokaryotes and eukaryotes provides convincing evidence (stronger than the HMM evidence) that EamA is the origin of the nucleotide sugar transporters found in H. sapiens.
MDR (Multi drug resistance) is identified as the likeliest single domain progenitor to EamA
In the EamA MAFFT-EINSI alignment (see Methods and additional file 4
: dmt.aln.tgz), 97.5% of the sequences have EamA in the first domain position. In addition, 44.2% of the sequences have EamA in the second domain position. The most prevalent non-EamA domain in the second position in the EamA alignment is TPT (found in 26.3% of sequences in the second position). This shows that EamA can exist in single copy form.
It also illustrates the high tendency for TPT to exist in heterogeneous constellation with other DMT domains. In the TPT alignment, all 138 sequences except one have TPT in the second DMT position. Of the sequences containing a TPT copy (the criteria for inclusion in TPT alignment), 43 are paired with an EamA domain in the first slot, and 3 sequences are paired with one each of UAA, NST or CRT-like. One sequence has a long C-terminal tail containing multiple non-DMT domains, and 91 sequences have no Pfam domain definition for the first "DMT slot", even though that area contains the same DMT-like TM helices as other annotated sequences.
Two HMMs were trained: one called 'EamA-1' trained on the first domain position in the EamA alignment (containing 97.5% EamA), and the second HMM on the first domain position in cases where the second position is not filled by EamA. The specialized single copy EamA HMM (called '1 × EamA') was taken to represent a more ancestral "unpaired" form of EamA.
Subsequently, 'EamA-1' and '1 × EamA' were queried against all single domain families having 4 or 5 TMs. Both of these EamA HMMs scored highest against MDR (97.5 and 96.4% HHsearch probability for 'EamA-1' and '1 × EamA', respectively). This result may indicate a close evolutionary relationship between EamA and the single domain family MDR [Table ]. MDR has a G-X(6)-G motif in its fourth trans-membrane region, with prevalence in the G positions of 47-88%, indicating that the 4th TM in MDR may correspond to the 5th TM in EamA.
Comparison of single domain DMTs with EamA
Further identification of recurrence of G-X(6)-G motif in paired DMT domains, excluding EamA-derived cases and Cation efflux (PF01545), confirms the importance of the motif
To buttress previously presented results concerning G-X(6)-G (Results), we identified G-X(6)-G in the remaining two domain DMTs. A measurement of >65% glycine frequency in TM-internal positions are recorded (Methods) [additional file 13
: supplementary table S10]. Three observations of G-X(6)-G are found in the 5th TM helix of either DMT-1 or DMT-2 of DUF1632, DUF803, Zip, CRT-like, SugT, RhaT, and FAE 3-ketoacyl-CoA synthase 1 family. Six additional observations of G-X(6)-G are found in the remaining TM segments of DMT-1 of DUF1632, DUF803, Zip, CRT-like, SugT, RhaT, and FAE 3-ketoacyl-CoA synthase 1 family.
The total frequency of G-X(6)-G (3 + 6 copies) in TM segments is almost twice as high as expected (~5 copies) from a simulation of random sequence, assuming a TM length of ~20 residues and a glycine frequency of 7% matching that found in our consensus sequences (Methods). Furthermore, if we consider only the 5th TM segment, where we expect to see ~0.5 G-X(6)-G, we have a 6-fold increase (we have 3 copies). Cation efflux (PF01545) does not contain any intra-helical glycine residues that are represented at a conservation level >65%, implying that this family differs in this structurally important aspect.
Creation of "breadth-first" clustering of first domains of 19 DMTs reveals three major groups: the EamA, DUF1632 and metal transporter clusters
Three clusters are discovered in a "breadth-first" clustering made using HHsearch probability to closest neighbor for the DMT-1 domain. These clusters are named EamA, DUF1632, and metal transporters [Figure ], from their human or most notable member family. The clustering principle is to join any nearest neighbor, making the clustering independent of any cutoff.
Figure 5 Breadth-first clustering of first domain of 19 DMTs using HHsearch HMMs. This figure was prepared using HHsearch all-against-all comparison of first domain of DMTs, to establish closest neighbor of each DMT. The arrows indicate in relation to which family (more ...)
Three clusters are defined: EamA (purple; based on nearest neighbor principle), DUF1632 (green; based on membrane orientation), and metal transporters (turquoise; based on TM and substrate profile).
The connections between DUF606/DUF1632 and FAE/SugT/RhaT are in the 70-80% HHsearch probability range, and DUF1632-1 displays 93.45% HHsearch probability to be related with TPT-2. DUF1632-2 displays 96.6% HHsearch probability to be related with EamA-1. The metal transporters display good similarity between DMT-1 and DMT-2 within the same faimily (93.8% for Cation and 75-97% for Zip). The similarity between Cation/Zip and other DMT families falls below 1% HHsearch probability (see additional file 14
'ALL HHS.ods' in HMM tar archive). CRCB is only very weakly similar to DUF606 (2.2% HHsearch probability).
All symmetrical two domain DMTs in the EamA cluster have cytosolic N- and C-termini, whereas all symmetrical two domain DMTs in the DUF6132 cluster have Golgi/endocytoplasmic reticulum/extracellular space-oriented (i.e. non-cytosolic) N- and C-termini, thus providing structural evidence corroborating the quality of the clustering. The membrane insertion is determined using Phobius prediction, but also agrees with the positive (K, R) inside rule. For example, in the DUF1632 alignment, 7 out of 9 K, R positions are placed between inbound and outbound (cytosolic segment) Phobius-predicted TM helices. Large TM variation could be observed in DUF606 [26
The membrane orientation illustrated using TMRPres2D [36
] in Figure shows, using example structures listed in the corresponding figure legend, the membrane orientation >50%. The lowest score for a two-domain case is Zip (58% non-cytosolic insertion), suggesting that this family differs substantially in this respect, and is unusually prone to variability in membrane insertion for a double domain DMT.
Examination of FAE 3-ketoacyl-CoA synthase 1 (PF07168) shows that it is unlikely to be correctly annotated as homologous to enzymes
FAE 3-ketoacyl-CoA synthase 1 is a member of the newly found DUF1632 cluster (Methods). The annotation on the PF07168 is: "This family contains fatty acid elongase 3-ketoacyl-CoA synthase 1, a plant enzyme approximately 350 residues long." In the five seed sequences of PF07168, however, all sequences are found to have the 5+5 TM structure typical of DMT solute carriers, raising doubts if the enzyme annotation in Pfam is correct. A search of "PF07168" in UniProt returns six sequences that have status "reviewed", which are annotated as "ureide permease 1-5", and "ureide permease A3", thus raising further doubts if the enzyme annotation in Pfam is correct.
To compare the alleged FAE 3-ketoacyl-CoA synthase 1 sequences with expert annotated FAEs, the B. napus
ketoacyl-CoA synthase (KCS) sequence [GenBank:AF009563
] and L. annua
KCS sequence [GenBank:EU871787
] were obtained [37
]. These proteins are only attached to the membrane by 2 TMs in the N-terminal fifth of the sequence, thus presenting a completely different TM architecture.
A protein sequence search on the NCBI website, limiting the search field to "title" and "fatty acid elongase 3-ketoacyl-CoA synthase 1", returned 11 hits. The TM structure of these sequences in six cases was the KCS-like structure with N-terminal transmembrane attachment through 2 TMs. In the remaining cases, transmembrane helices could not be reliably predicted, and in one case [GenBank:AAM64564.1], there was a conflicting "ureide permease" annotation. In the next release of Pfam (25.0), the above results will form the basis of re-annotation of PF07168 from fatty acid elongase 3-ketoacyl-CoA synthase 1, to ureide permease (communication with Pfam).
A possible background to the anomalous annotation may be the fact that e.g. SLC35F5 in M. musculus and B. taurus have an extended gap between the first 2 TM segments and the remaining 8 TM segments, meaning that with incomplete sequence data, the overall N-terminal TM structure of SLC35F5 resembles the overall N-terminal TM structure of many plant fatty acid elongases.