|Home | About | Journals | Submit | Contact Us | Français|
The drug/metabolite transporter superfamily comprises a diversity of protein domain families with multiple functions including transport of nucleotide sugars. Drug/metabolite transporter domains are contained in both solute carrier families 30, 35 and 39 proteins as well as in acyl-malonyl condensing enzyme proteins. In this paper, we present an evolutionary analysis of nucleotide sugar transporters in relation to the entire superfamily of drug/metabolite transporters that considers crucial intra-protein duplication events that have shaped the transporters. We use a method that combines the strengths of hidden Markov models and maximum likelihood to find relationships between drug/metabolite transporter families, and branches within families.
We present evidence that the triose-phosphate transporters, domain unknown function 914, uracil-diphosphate glucose-N-acetylglucosamine, and nucleotide sugar transporter families have evolved from a domain duplication event before the radiation of Viridiplantae in the EamA family (previously called domain unknown function 6). We identify previously unknown branches in the solute carrier 30, 35 and 39 protein families that emerged simultaneously as key physiological developments after the radiation of Viridiplantae, including the "35C/E" branch of EamA, which formed in the lineage of T. adhaerens (Animalia). We identify a second cluster of DMTs, called the domain unknown function 1632 cluster, which has non-cytosolic N- and C-termini, and thus appears to have been formed from a different domain duplication event. We identify a previously uncharacterized motif, G-X(6)-G, which is overrepresented in the fifth transmembrane helix of C-terminal domains. We present evidence that the family called fatty acid elongases are homologous to transporters, not enzymes as had previously been thought.
The nucleotide sugar transporters families were formed through differentiation of the gene cluster EamA (domain unknown function 6) before Viridiplantae, showing for the first time the significance of EamA.
Transmembrane helix (TM) proteins form 27% of the human proteome [1,2]. Solute carriers (SLCs) constitute the second largest family of TM proteins . There are 51 SLC classes, according to sequence similarity and functional properties, containing at least 386 human SLCs [3,4]. Three of the largest SLC families, SLC30, SLC35 and SLC39, comprising at least 10, 23, and 14 human proteins, respectively, contain protein domains that are members of the drug/metabolite transporter ("DMT") clan CL0184 in Pfam 24.0 . A recent study presented evidence that the DMT-containing proteins are relatively dissimilar from other SLCs, and were present before the divergence of Bilateria .
The DMT clan comprises transporter proteins that have a remarkably wide substrate range, from proteins that transport nucleotide-sugar conjugates in the Golgi apparatus (SLC35), to metal ion transporters (SLC30, SLC39) and bacterial proteins that transport toxins, such as camphor, chloroquine, or ethidium bromide [7-9]. Interestingly, the SLC35 proteins could constitute one of the evolutionary bottlenecks of the emergence of multicellularity that depends on proteoglycans, which are built from nucleotide sugar-conjugates and exported to the extracellular matrix. Thus, evolutionary steps in cell surface molecule-dependent human biology may be reflected in DMT domain evolution [10,11]. For example, the Notch receptor, involved in cell fate determination, as well as T-cell lineage development commitment, intercellular communication and neuronal development, requires fucosylation and GDP-fucose transport by SLC35C1, SLC35C2 to function . Gene expression and copy number variation studies have identified SLC35E2 as a tumor suppressor gene in neuroblastoma , and SLC35E3 as an overexpressed gene in glioblastoma . SLC35F1 and SLC35F3-4 are uncharacterized but have been found to be expressed in brain, specifically in the cerebellum .
The superfamily of DMTs was defined and expanded using iterative homology search, using 14 pre-existing families, of which 6 are exclusively prokaryotic by Jack DL, Yang NM, Saier MH, Jr. (2001) . In the transporter classification database, there are 26 DMT families , using nomenclature adapted from the enzyme commission. Jack DL, Yang NM, Saier MH, Jr. (2001) presented a theory from their observations of TM structures in microorganism and bacteria-based phylogenies, that the DMTs have undergone duplications such that 4 TM proteins (i.e. proteins with 4 transmembrane-spanning regions) gained an extra helix, and then duplicated to a 10 TM state, or possibly duplicated first and then gained 1-2 extra helices [16,18]. Subsequently, Pfam, the most widely used domain database, introduced its DMT clan (CL0184), by applying a standard clan definition scheme on the iterative homology data set of Jack DL, Yang NM, Saier MH, Jr. (2001). In its present form, CL0184 has 19 member families, including the large and diverse EamA family, which is named after the O-acetylserine/cysteine export gene in E. coli and was previously known as "DUF6" . The Pfam DMT clan encompasses data from more species than Jack DL, Yang NM, Saier MH, Jr. (2001), but does not explicitly recognize that DMT-encoding genes have either single or double copies of DMT domains. This means that Pfam has been instrumental in defining the DMT superfamily through an automated scheme that does not incorporate the current evolutionary model of DMT proteins. The consequence is that the inclusion criteria is based on full-length sequence data, and may cluster sequences incorrectly.
Alphabetical nomenclature of SLC classes was introduced by HUGO/HGNC [20-22]. The work was, analogously to Pfam, performed independently of the original DMT phylogeny, and has been elaborated since 2004 in databases such as SLC tables (http://www.bioparadigms.org/slc/menu.asp). The SLC classification system has been shaped on the level of complete proteins -not domains- and is not specifically based on work that incorporates the two domain conjecture. In this paper, we delineate the evolution of these protein domain families considering the two parallel classification systems, one used for bacteria (DMT), and one for animals/H. sapiens (SLC). Until now, no comprehensive study has been published on DMT that is based on the two domain conjecture and uses all the 19 DMT families in Pfam, in animals, plants and bacteria. To focus our study, we will primarily investigate the evolutionary origin of nucleotide sugar transporters (NSTs) in the context of the DMT evolution. The term nucleotide sugar transporter is taken here to apply to all transporters of nucelotide sugars, not only the Pfam DMT family explicitly called "NST", but also other DMT families that transport such substances. The main difference between the DMT and SLC terms is that DMT is a type of domain, referring to the original study of Jack DL, Yang NM, Saier MH, Jr. (2001), and the machine annotated Pfam superfamily, whereas SLC is a large family of full-length proteins, which contains some subfamilies that contain DMT domains.
There are at least four reports that strengthen the theory that single domain proteins gain a helix and duplicate to a ~10 TM configuration. Firstly, it has been shown that if DMTs exist in a single domain form, their lysine and arginine residues (but not other positive residues; personal communication Prof. von Heijne) are carefully balanced in the membrane to enable 'dual topology' insertion [23,24]. 'Dual topology' means that a single domain DMT protein can insert into the membrane facing either direction, and implies that transport activity would necessitate interaction between two oppositely oriented DMT single domain units. Secondly, if DMT proteins exist in paired form in the same gene, the halves have permanent and opposite orientations, having their positive residues on the cytoplasmic side. The insertion direction of the EmrE gene product (Multidrug resistance family) can be controlled in a model system, to show that interaction between two oppositely oriented single DMT domains is responsible for protein activity . Thirdly, in confirmation there are genomically paired but un-fused single domain genes in the DUF606 family that are already locked in their lysine and arginine bias. Nine independent duplication events in DUF606 (5 paired, and 4 fused) can be demonstrated with evidence of evolution from 4 TM units to paired or fused genes that give fixed and opposite membrane orientation of the two DMT domains and resulting protein activity . Finally, a consensus of the EmrE structure was established from a 4 + 4 TM asymmetric structure, by cryoelectron microscopy and a revised X-ray structure, which differ by only 1.4 Ångström in root mean square deviation, confirming the asymmetric paired structure .
The current study defines the first and second domains in relation to the TM segments in the ten DMT families containing human proteins, henceforth referred to as the "human DMT families" in this paper. Maximum likelihood (ML) phylogenetic trees are made for the first domains, to find models of DMT subfamily evolution. The phylogenetic trees are resolved, and the oldest model organism sequence in each branch is used to estimate the age of the subfamily. To identify the origin of human 5 + 5 TM architecture DMTs, hidden Markov model (HMM) comparison was applied on first and second domains. The use of both maximum likelihood bootstrap forests and hidden Markov models is particularly applicable to DMTs, as they constitute a large and diverse Pfam clan. Here we give a detailed definition and description of the human DMT/SLC35 family and present an intriguing evolutionary history which supports an ancient internal duplication in EamA.
To ascertain the distribution of DMT families [Table [Table1]1] in a diverse set of kingdoms and phyla and to obtain a representative set of protein sequences for further analysis, we selected a dozen model organisms - 9 animals, 1 fungus, 1 Amoebozoa, and 1 plant [additional file 1: supplementary table S1] - and then searched the proteomes of each for matches to the ten human DMT families [additional file 2: supplementary table S2]. This was done using standard Pfam tools (see Methods). Within the organisms selected, the representation of families ranges from high (e.g. EamA/Zip/Cation efflux families each have ~10 proteins per species) to low (e.g., UPF0546 ~1 protein per species). All sequences are listed in [additional file 3: supplementary table S3].
For each DMT family, the members found in the 12 selected organisms were aligned at the protein level using MAFFT-EINSI . Manual editing was then performed to realign/remove poorly aligned sequences using Maxalign (v1.1) , with the additional goals of retaining all human protein sequences and retaining regions with conserved transmembrane characteristics (defined here as regions in which over 80% of the sequences are predicted to be transmembrane according to Phobius, a leading transmembrane topology predictor ). The alignments can be found in the additional material (additional file 4: dmt.aln.tgz).
The conception of a "domain" as a stably folding protein segment is not technically applicable to membrane-inserted proteins, and we will in this paper use "domain" to describe a part of a membrane inserted protein that has an independent evolutionary history .
Seven of the human DMT families were cut into N-terminal and C-terminal domains by symmetry. Cutting by symmetry means that we divide a 10 TM family as 5 + 5 TM structure, or that it is a single domain family that will not be subject to cutting. When we refer to 'single domain' DMT families, this can either mean that the domain tends to exist in single copy inserted in a longer sequence (e.g. TPT) or that the whole sequence is only that domain (e.g. UPF0546).
The three remaining ('asymmetric') human families that could not have domains defined in this straightforward way are: DUF803, Cation efflux, and Zip. In the case of Zip, there is considerable length of aligning sequence after the last TM block that is part of Pfam's Zip domain.
Canonical example sequences of these anomalous ('asymmetric') families were analyzed with DLP-SVM (an SVM that recognizes domain linker peptides) and TMHMM [Figure [Figure1],1], suggesting a 4 + 5, 4 + 2, and 3 + 5 TM architecture in relation to the DMT domain border, respectively. Finally, placement of the domain boundary in the Zip and Cation efflux families was determined by the concurrent location of Pfam low complexity regions as well as a large gapped region in the alignment, and in the case of DUF803 by use of a generic HMM recognizing the DMT-1 and DMT-2 domains. Tables S4 [additional file 5: supplementary table S4] and 2 list different evidence type used to locate domain border and the final conclusion for each family.
Using the knowledge of the domain architecture, we then extracted just the first domain (here termed DMT-1) from the overall alignments obtained previously. We made 10 DMT dendrograms, such as the one shown in Figure Figure2,2, using the DMT-1 domains with RAxML . We resolved the trees [additional file 6: supplementary figure S1; figure figure2],2], i.e. ensured that they do not contain any nodes with bootstrap support <50%, using tools to edit the bootstrap forests (Methods; additional file 7: human_dmt-1.dendr.tgz). The number of sequences in the resolved tree is smaller than the original number of sequences because of the editing process [additional file 8: supplementary table S5].
The organisms found in the branches can in theory be treated as markers of the age of the branch system, but due to the possibility of excessive sequence deletion or polychotomous tree formation, these trees should be viewed as hypothesis generating material, rather than accurate date estimates.
Only three of the dendrograms are found to contain stable independent branches that exist in at least 6 organisms: EamA, Zip (PF02535), Cation efflux (PF01545). The EamA, Zip, and Cation efflux dendrograms each contain four distinct branches [additional file 9: supplementary table S6]. In the other seven families the low number of sequences in the analysis limits the number of identifiable branches.
In Cation efflux and Zip, the SLC39A11 and the so called 'chicken-specific branch' were formed in or before T. adhaerens. This could be related to the ion transport needs of the proto-synaptic system of T. adhaerens, and the subsequent emergence of a primitive nervous system in N. vectensis [additional file 1: supplementary table S1].
Before further detailed analysis of the individual family dendrograms, we sought to identify which DMT family was the most likely origin of the nucleotide sugar transporters. The aim was to find which of the DMT families having a nucleotide sugar transporter function displayed most similarity to the other nucleotide sugar transporters with respect to its DMT-1 and DMT-2 domains.
To obtain a quantitative measure of similarity, we trained HMMs on each domain halve of the nucleotide sugar transporter families (EamA, TPT, DUF914, UAA, NST), and then found the similarity between every pairwise combination of domain HMMs using the HHsearch program. The HHsearch probability  values indicate the probability that the two HMMs are significantly related. Two complementary methods were then employed to visualize the relationships embedded within that matrix.
Non-metric multidimensional scaling  was used (see Methods) to construct a two-dimensional representation of the similarity data (Figure (Figure3A)3A) in which the domains are positioned so that the distances between them reflects as much as possible the original dissimilarity values. The resulting configuration shows a striking bipartitioning of the domains that recapitulates whether they are first or second domains. Furthermore, the domains nearest the notional "boundary" between the two clusters are EamA-1 (DUF6-1) and EamA-2 (DUF6-2). Our interpretation of this is that it suggests that EamA was the original family from which the other four nucleotide sugar transporter families evolved.
The second method was to select the most highly similar pairs of domains, connect them by "edges" in a graph, and seek a visual representation of that graph. We used 99.05% HHsearch similarity as a threshold to define edges, because empirically 99.05% is the lowest threshold we could set and still obtain a connected graph between all ten nucleotide sugar transporter domain halves. A 99%-cutoff translates to a p-value of 5.20E-18, in the HHsearch conversion.
The result (Figure (Figure3B)3B) was a planar graph (v-e + f = 2), where v is the number of vertices, e the number of edges and f the number of faces. It can be seen from Figure Figure33 that the EamA (and TPT) family nodes have the highest degree: d(EamA-1) = 5, d(EamA-2) = 6, d(TPT-1) = 3, and d(TPT-2) = 5. These results suggests that the nucleotide sugar transporters may have differentiated from EamA.
In addition, if we plot the closest DMT-2 neighbors of the DMT1s and vice versa, we note that this also results in a graph (data not shown) with highest degree assigned to EamA/TPT. The closest DMT-2 neighbor of NST-1 is TPT-2 (98.1%), and the closest DMT-2 neighbor of DUF914-1/UAA-1 is EamA-2 (98.7 and 97.7% HHsearch prob.). The closest DMT-1 neighbor of NST-2/UAA-2/DUF914-2 is EamA-1 (95.2, 98.5, and 98.5% HHsearch prob.). Our interpretation of these results is that the nucleotide sugar transporters have differentiated from EamA.
We produced a "divergence process table" [Table [Table2],2], using the values from HHsearch for the first and second domains of DUF914, EamA, NST, TPT, UAA families. The table shows the increasing distance from 0.6 to 96.6%, measured in units of 100-HHsearch probability, between the first and second domains, as well as the increasing distance to EamA from zero to 2.4%, measured in units of 100-HHsearch probability. Note that the ordering of the interdomain distances exactly replicates the ordering of the distance from EamA, and that the NST interdomain halve similarity is surprisingly low (96.6% expressed as distance, despite close relationship with EamA).
Thus the data is consistent with a scenario in which the nucleotide sugar transporters were formed from a duplication event in EamA. Next we examine the HMMs for the human 5+5 TM nucleotide sugar transporters and find recurrent motifs, before returning to the resolved dendrogram for EamA to study its component sequences.
Table Table33 shows the matching TM segment-internal residues for pair-wise comparisons of HHsearch consensus sequences of pairs of DMT domain halves that are presumed, from Results, to have evolved from each other. A recurrent motif is identified in the 5th TM helix of the second DMT domain, G-X(6)-G. It appears to have been lost in NST, but exists in both the first and second domains of EamA, and in the second domain of TPT, DUF914, and UAA. The consensus sequence residues in the HHsearch models have a conservation of ~33% (depending on the amount of gaps - personal communication with Johannes Söding), but could be much higher. Inspection of the EamA alignment (additional file 4: dmt.aln.tgz) reveals that the conservation of these glycines are 81% and 68%, respectively, in the glycine-6-glycine motif, making it the most conserved intra-helical motif in the EamA alignment. A MEME motif search returned G-X(6)-G as a constituent of a motif, having the regular expression K[VI][VL]GT[LI][VLI][CS][VI][GA]GAL[VL][ML]T[LF]YKGP and the e-value 3.0e-338.
Both versions of the EamA tree contain the AMAC and SLC35C/E clusters, suggesting that the removal of sequences does not affect the main subfamilies of EamA. We selected EamA for exact dating of independent branches because it is established in Results that it is one of three human DMT families that display distinct branches, and in Results that EamA is the origin of human 5 + 5 TM structure DMTs. The following EamA subfamilies were found: AMACs, SLC35Fs, SLC35C/Es and PUPs. The subfamilies are named from their human sequence constituents.
The oldest model organism represented in the AMAC and SLC35C/E branches are respectively: D. discoideum (sequence [Dictybase:DDB0184471]) and T. adhaerens (sequence [JGI:e_gw184.108.40.206]). Because excessive editing of the bootstrap forests could result in erroneous deletion of ancient organism sequences, the steps 2.8-2.10 are undertaken to confirm these results. The oldest model organism sequence of the SLC35Fs is an A. thaliana sequence [TAIR:AT4G32140.1-P], thus making it impossible to date the SLC35F subfamily using this model organism selection.
We used third party annotation (TPA) in DNA databank of Japan, DDBJ, to supplement annotation to the oldest model organism sequence in the independent branches of the resolved bootstrap forest of EamA: D. discoideum TMEM20 (sequence [Dictybase:DDB0184471]) was super-annotated as [DDBJ:BR000891], and T. adhaerens SLC35E1 (sequence [JGI:e_gw220.127.116.11]) was super-annotated as [DDBJ:BR000889].
We extend the analysis to a new set of 12 species (9 new species, T. adhaerens, D. discoideum, and A. thaliana; see Figure Figure4),4), because a dozen species obviously represents a rather limited sampling of species, making our ability to resolve the time of emergence of any subfamily limited. We compared TimeTree divergence time data (measured from H. sapiens) for the new set of 12 species to the TimeTree divergence time data from the old set of 12 species (reporting the difference between the corresponding ordinal measurements in the right hand margin of Figure Figure4)4) .
A TBLASTN 2.2.24+ querying of the human counterparts (human TMEM20 and human SLC35E1) of the oldest model organism sequences against the "nr" database (limiting the search to one model organism at a time) show that D. discoideum TMEM20 and T. adhaerens SLC35E1 sequences are indeed the oldest constituents in their respective subfamilies [Figure [Figure4],4], because there is a sharp drop (to ~1/3 of its score) in sequence quality of hits in model organisms with a higher divergence time to H. sapiens than the oldest model organism sequences.
We took the other human constituents of the branches, i.e. SLC35C1, TMEM22 and queried them against "nr" plants (taxid:3193); the quality of the hits is 45-57 (E-values: 4E-04 and 5E-09), with a coverage of 40-67% for the respective best hit, for H. sapiens SLC35C1 [Genbank:AAH01427] and H. sapiens TMEM22 [GenBank:AAH22557]. AMAC is not represented in plants at all. These hits are not of comparable quality to those obtained between the human sequence and the oldest model organism sequences identified above, where we had scores above 100 and cover the full-length sequences. Thus, it appears that our oldest model organism sequence from the branches found in EamA in the hypothesis generating section of the study agrees with the result from BLAST-based method.
We determined molecular clock distances between the oldest model organism sequence in the AMAC and SLC35C/E subfamilies of EamA and the respective orthologue sequences in human. The simpler (clocklike) tree is rejected on a significance level of 5%. The following distances were obtained: TMEM20 (1.79 units from oldest model organism sequence to human orthologue in the AMAC subfamily; this measurement refers to a pairwise sequence comparison), SLC35E1 (0.89 units), and SLC35F5 (1.86 units - as a reference measurement). These distances theoretically correspond to the divergence time estimate of H. sapiens-D. discoideum (1628 Million years ago; Mya), H. sapiens-A. thaliana (1628 Mya - using the Fungi/Metazoa group/Viridiplantae divergence time estimate), and H. sapiens-T. adhaerens (1009 Mya), using TimeTree divergence time estimates. Thus, using the SLC35F5 distance as a reference, the Mya estimates for the other sequence pairs are: TMEM20 (~1567 Mya) and SLC35E1 (~779 Mya). The divergence time of these sequences is congruent with an emergence of the AMAC and SLC35C/E subfamilies in Animalia, rather than in Viridiplantae.
To gain a wider perspective on the evolution of the entire DMT clan, we also investigated the representation of all of the 19 DMT families listed in Pfam (i.e. the 10 families that have human members, and the other 9 families that do not). Using a separate extraction of 10+9 DMT families [additional file 2: supplementary table S2; additional file 10: supplementary table S7] in plants and bacteria, we compare the distribution of the number of sequences found [Table [Table44 and additional file 11: supplementary table S8].
The distribution of sequences between DMT families in plants and bacteria differs dramatically (compare Table Table44 and additional file 11: supplementary table S8). In plants, the human DMT families we have discussed earlier are present in large numbers, using the sequence retrieval procedure presented in the methods section: TPT (512 sequences in plants), DUF914 (58 sequences), UAA (368 sequences), NST (75 sequences), DUF803 (90 sequences). In bacteria we obtain the following counts for the same families: TPT (3 sequences), DUF914 (1 sequence), UAA (0 sequences), NST (1 sequence), DUF803 (35 sequences). The DMT families not represented in human, presented in Table S7 [additional file 10: supplementary table S7], display the opposite distribution pattern, having large numbers in the bacteria and small numbers in plants. For details about the sequences used to analyze DMT families not represented in human, see Table S9 [additional file 12: supplementary table S9] and the additional alignments (additional file 4: dmt.aln.tgz).
The number of EamA sequences differs between the different plant species: poplar, maize, rice, and grape have more than 100 EamA proteins per species. Moss, algae and spruce have 20-50 copies per species. Peas, beans, and grass have less than 10 EamA proteins per species. Purple false brome, which is a monocot, has zero copies of EamA, using the sequence retrieval in Methods. The observation that EamA is the only 5+5 TM nucleotide sugar transporter that is present in both prokaryotes and eukaryotes provides convincing evidence (stronger than the HMM evidence) that EamA is the origin of the nucleotide sugar transporters found in H. sapiens.
In the EamA MAFFT-EINSI alignment (see Methods and additional file 4: dmt.aln.tgz), 97.5% of the sequences have EamA in the first domain position. In addition, 44.2% of the sequences have EamA in the second domain position. The most prevalent non-EamA domain in the second position in the EamA alignment is TPT (found in 26.3% of sequences in the second position). This shows that EamA can exist in single copy form.
It also illustrates the high tendency for TPT to exist in heterogeneous constellation with other DMT domains. In the TPT alignment, all 138 sequences except one have TPT in the second DMT position. Of the sequences containing a TPT copy (the criteria for inclusion in TPT alignment), 43 are paired with an EamA domain in the first slot, and 3 sequences are paired with one each of UAA, NST or CRT-like. One sequence has a long C-terminal tail containing multiple non-DMT domains, and 91 sequences have no Pfam domain definition for the first "DMT slot", even though that area contains the same DMT-like TM helices as other annotated sequences.
Two HMMs were trained: one called 'EamA-1' trained on the first domain position in the EamA alignment (containing 97.5% EamA), and the second HMM on the first domain position in cases where the second position is not filled by EamA. The specialized single copy EamA HMM (called '1 × EamA') was taken to represent a more ancestral "unpaired" form of EamA.
Subsequently, 'EamA-1' and '1 × EamA' were queried against all single domain families having 4 or 5 TMs. Both of these EamA HMMs scored highest against MDR (97.5 and 96.4% HHsearch probability for 'EamA-1' and '1 × EamA', respectively). This result may indicate a close evolutionary relationship between EamA and the single domain family MDR [Table [Table5].5]. MDR has a G-X(6)-G motif in its fourth trans-membrane region, with prevalence in the G positions of 47-88%, indicating that the 4th TM in MDR may correspond to the 5th TM in EamA.
To buttress previously presented results concerning G-X(6)-G (Results), we identified G-X(6)-G in the remaining two domain DMTs. A measurement of >65% glycine frequency in TM-internal positions are recorded (Methods) [additional file 13: supplementary table S10]. Three observations of G-X(6)-G are found in the 5th TM helix of either DMT-1 or DMT-2 of DUF1632, DUF803, Zip, CRT-like, SugT, RhaT, and FAE 3-ketoacyl-CoA synthase 1 family. Six additional observations of G-X(6)-G are found in the remaining TM segments of DMT-1 of DUF1632, DUF803, Zip, CRT-like, SugT, RhaT, and FAE 3-ketoacyl-CoA synthase 1 family.
The total frequency of G-X(6)-G (3 + 6 copies) in TM segments is almost twice as high as expected (~5 copies) from a simulation of random sequence, assuming a TM length of ~20 residues and a glycine frequency of 7% matching that found in our consensus sequences (Methods). Furthermore, if we consider only the 5th TM segment, where we expect to see ~0.5 G-X(6)-G, we have a 6-fold increase (we have 3 copies). Cation efflux (PF01545) does not contain any intra-helical glycine residues that are represented at a conservation level >65%, implying that this family differs in this structurally important aspect.
Three clusters are discovered in a "breadth-first" clustering made using HHsearch probability to closest neighbor for the DMT-1 domain. These clusters are named EamA, DUF1632, and metal transporters [Figure [Figure5],5], from their human or most notable member family. The clustering principle is to join any nearest neighbor, making the clustering independent of any cutoff.
Three clusters are defined: EamA (purple; based on nearest neighbor principle), DUF1632 (green; based on membrane orientation), and metal transporters (turquoise; based on TM and substrate profile).
The connections between DUF606/DUF1632 and FAE/SugT/RhaT are in the 70-80% HHsearch probability range, and DUF1632-1 displays 93.45% HHsearch probability to be related with TPT-2. DUF1632-2 displays 96.6% HHsearch probability to be related with EamA-1. The metal transporters display good similarity between DMT-1 and DMT-2 within the same faimily (93.8% for Cation and 75-97% for Zip). The similarity between Cation/Zip and other DMT families falls below 1% HHsearch probability (see additional file 14 'ALL HHS.ods' in HMM tar archive). CRCB is only very weakly similar to DUF606 (2.2% HHsearch probability).
All symmetrical two domain DMTs in the EamA cluster have cytosolic N- and C-termini, whereas all symmetrical two domain DMTs in the DUF6132 cluster have Golgi/endocytoplasmic reticulum/extracellular space-oriented (i.e. non-cytosolic) N- and C-termini, thus providing structural evidence corroborating the quality of the clustering. The membrane insertion is determined using Phobius prediction, but also agrees with the positive (K, R) inside rule. For example, in the DUF1632 alignment, 7 out of 9 K, R positions are placed between inbound and outbound (cytosolic segment) Phobius-predicted TM helices. Large TM variation could be observed in DUF606 .
The membrane orientation illustrated using TMRPres2D  in Figure Figure55 shows, using example structures listed in the corresponding figure legend, the membrane orientation >50%. The lowest score for a two-domain case is Zip (58% non-cytosolic insertion), suggesting that this family differs substantially in this respect, and is unusually prone to variability in membrane insertion for a double domain DMT.
FAE 3-ketoacyl-CoA synthase 1 is a member of the newly found DUF1632 cluster (Methods). The annotation on the PF07168 is: "This family contains fatty acid elongase 3-ketoacyl-CoA synthase 1, a plant enzyme approximately 350 residues long." In the five seed sequences of PF07168, however, all sequences are found to have the 5+5 TM structure typical of DMT solute carriers, raising doubts if the enzyme annotation in Pfam is correct. A search of "PF07168" in UniProt returns six sequences that have status "reviewed", which are annotated as "ureide permease 1-5", and "ureide permease A3", thus raising further doubts if the enzyme annotation in Pfam is correct.
To compare the alleged FAE 3-ketoacyl-CoA synthase 1 sequences with expert annotated FAEs, the B. napus ketoacyl-CoA synthase (KCS) sequence [GenBank:AF009563] and L. annua KCS sequence [GenBank:EU871787] were obtained . These proteins are only attached to the membrane by 2 TMs in the N-terminal fifth of the sequence, thus presenting a completely different TM architecture.
A protein sequence search on the NCBI website, limiting the search field to "title" and "fatty acid elongase 3-ketoacyl-CoA synthase 1", returned 11 hits. The TM structure of these sequences in six cases was the KCS-like structure with N-terminal transmembrane attachment through 2 TMs. In the remaining cases, transmembrane helices could not be reliably predicted, and in one case [GenBank:AAM64564.1], there was a conflicting "ureide permease" annotation. In the next release of Pfam (25.0), the above results will form the basis of re-annotation of PF07168 from fatty acid elongase 3-ketoacyl-CoA synthase 1, to ureide permease (communication with Pfam).
A possible background to the anomalous annotation may be the fact that e.g. SLC35F5 in M. musculus and B. taurus have an extended gap between the first 2 TM segments and the remaining 8 TM segments, meaning that with incomplete sequence data, the overall N-terminal TM structure of SLC35F5 resembles the overall N-terminal TM structure of many plant fatty acid elongases.
In this paper we present a comprehensive hierarchical order for the different clusters of human DMT containing proteins that is based on the suggestion that DMTs consist of either one or two DMT domains. It has been our ambition to identify the independent branches of DMT families and trace how they have evolved in key model organisms, and to compare the relationship between DMT families to understand how they have been formed by domain duplications, presumably meiotic unequal crossing over. To this end, a bioinformatics strategy that combines the strengths of hidden Markov models and resolved maximum likelihood dendrograms was found to be the most successful approach to gain these results.
Here we propose that the DMT families were formed from EamA, most likely through a domain duplication event before the radiation of Viridiplantae. The evidence supporting this statement is that the sequence dissimilarity between domain 1 and 2 in the 5+5 TM structure nucleotide sugar transporters successively increases from 0.6 to 96.6% HHsearch probability (given as 100-p) between the two domain in each of: EamA, TPT, DUF914, UAA, and NST [Table [Table2].2]. Furthermore, the average sequence dissimilarity between each human 5+5 TM structure DMT family and EamA increase from 0.7 to 2.4% HHsearch probability, in the exact same order: TPT or DUF914, UAA, and NST.
The EamA-derived 5+5 TM structure families that are represented in H. sapiens, change their sequence number distribution pattern in Pfam greatly between bacteria and plants [Table [Table44 and additional file 11: supplementary table S8]. In bacteria, the EamA-derived families are present in very small numbers: TPT (3 sequences), DUF914 (1 sequence), UAA (0 sequences), NST (1 sequence). This can be compared to EamA, which has 13,428 sequences in the bacterial phylum listed in [additional file 11: supplementary table S8], providing further -and probably stronger- evidence that EamA was the bacterial ancestor of TPT, DUF914, UAA, and NST.
It can be noted in UniProt that there is some limited evidence pointing to a trend of increased specialization in known substrate repertoire for human, EamA-derived nucleotide sugar transporters, corresponding to the suggested evolutionary distance from EamA (Figures (Figures3A3A and and3B).3B). For instance, SLC35A1-A3 (NSTs) have one documented substrate each, whereas SLC35B2-3 (UAA) both share the same substrate, and SLC35D1 and D2 (TPT) have multiple substrates per transporter [additional file 15: supplementary table S11].
Thus, the EamA-derived families encompass all human SLC35 sequences: SLC35A1-5 (NST), SLC35B1-4 (UAA), SLC35F1-2 (DUF914), SLC35C1-2 (EamA, TPT), SLC35D1-3 (TPT), SLC35E1-4 (EamA, TPT), SLC35F3-5 (EamA).
We found that "AMAC" and SLC35C/E form two independent subfamilies, exhibiting bootstrap support >50% within the EamA tree [Figure [Figure2].2]. Moreover, the oldest model organism sequences within each subfamily indicate that the AMAC subfamily (but not AMAC itself) was formed in the lineage of D. discoideum [DDBJ:BR000891], and that the SLC35C/E subfamily was formed in the lineage of T. adhaerens [DDBJ:BR000889]. The human orthologues of these sequences are TMEM20 and SLC35E1 respectively.
By using H. sapiens TMEM20 and SLC35E1 as TBLASTN 2.2.24+ queries against a set of eukaryotic model organisms, it was confirmed that there is a sharp drop in sequence similarity after the presumed divergence time of the oldest model organism sequences [Figure [Figure4],4], well after the radiation of Viridiplantae.
As further evidence supporting that "AMAC" and SLC35C/E were formed in Animalia, a TreePuzzle Molecular Clock experiment shows that the estimated divergence time between TMEM20 in H. sapiens and D. discoideum is 1567 Mya. The Mya estimate between SLC35E1 in H. sapiens and T. adhaerens is 779 Mya. These results use the TreePuzzle divergence time estimate between H. sapiens and A. thaliana SLC35F5 as reference measurement, assuming that they diverged 1628 Mya.
These results indicate that the SLC35C/E lineage, which contains SLC35C1-2 transporters necessary for fucosylation of the Notch receptor , and the SLC35E1-4 transporters known to function as oncogenes/tumor suppressor genes in e.g. neuroblastoma  and glioblastoma , was probably formed in Animalia. TMEM22, from the "AMAC" subfamily, is also involved in cancer .
The AMAC (acyl-malonyl condensing enzyme) subfamily was formed in multicellular organisms, with a smaller divergence time from human than Viridiplantae. It is suspected "AMAC" is an incorrect enzyme annotation on a 5+5 TM structure DMT transporter , because of its high similarity to solute carriers, especially EamA domain containing proteins such as TMEM20. It should be noted however, that AMAC has a G-X(5)-G motif (not G-X(6)-G) in its final TM segment in DMT-2, showing that if it is a DMT, it differs in this structural respect. AMAC is an interchangeable, but more general biochemical term than FAE 3-ketoacyl-CoA synthase 1, which would refer only to synthase #1. HGNC has informed that, based on this study, the AMAC1 and AMAC-like (AMAC1L1, AMAC1L2, AMAC1L3) sequences, will be re-named to SLC35Fs in RefSeq for Human and Mouse in the future.
Glycine, which is a helix breaker in globular proteins, is found in TM helices . The G-X(6)-G motif is found in both EamA domains, the TPT-2 domain, the DUF914-2 domain, and in UAA-2 in the HHsearch consensus sequences [Table [Table3].3]. For two domain DMTs that are not EamA-derived, nine TM-internal G-X(6)-G are found that have a >=65% frequency, of which three are found in the 5th TM helix of DMT-2 [Table [Table3].3]. This motif may have the role of introducing flexibility, or permit ion passage in the helix, considering that a shift of seven residues would be the optimal distance to orient two residues one helix turn apart. It is stated on the DMT Pfam entry (CL0184) that many sequences contain a characteristic glycine-rich motif in the C-terminal sequence, but we have reported a more detailed characterization of this feature. Apart from the G-X(6)-G feature, and the two domain structure, it can be noted that DMTs (except metal transporters) tend to have short loops (~10 amino acids) between the TM segments, making the occurrence of any secondary structure in such linker peptides unlikely.
Three main clusters are defined by "breadth first" clustering of the first domain of 19 DMT families, connecting nearest neighbors [Figure [Figure5].5]. The DUF1632 and EamA clusters are reinforced by conserved membrane insertion orientation in 6 out of 11 of the families in the EamA cluster, and 4 out of 6 of the families in the DUF1632 cluster. Furthermore, the EamA cluster is reinforced by the fact that all nucleotide sugar transporters (TPT, DUF914, UAA, NST) are contained within the cluster. The strongest relations between the DUF1632 and EamA clusters (93.5 and 96.6% HHsearch probability between each halve of DUF1632 and TPT-2 or EamA-1) is >100-fold stronger than the intercluster edge between the metal transporter cluster and any member of the EamA cluster. The lower limit for sequence composition as measured by HHsearch "edges" between the families in the EamA cluster is >55% HHsearch probability (UPF0060 to MDR), compared to >1% in the metal transporter cluster. The paired, but asymmetric transmembrane structure of the metal transporters, and metal ion substrates, highlights the fact that the metal transporters differ substantially in these respects from the other DMTs. The high similarity between DMT-1 and DMT-2 in the metal transporters highlights that they appear to have been formed by domain duplication.
The annotation of PF07168 as fatty acid elongase 3-ketoacyl-CoA synthase 1 appears incorrect because the proteins containing this DMT domain have the same 10 TM configuration and sequence similarity with other solute carriers in the DMT superfamily, and not the 2 TM configuration of well documented fatty acid elongases. The UniProt reviewed annotation for FAE containing proteins is "ureide permease". In the next release of Pfam (25.0), this result will form the basis of re-annotation of PF07168 from fatty acid elongase 3-ketoacyl-CoA synthase 1, to ureide permease (communication with Pfam).
We have established that the SLC35 nucleotide sugar transporters were formed from a duplication event in EamA, probably before the radiation of Viridiplantae. A cluster of DMTs, called DUF1632, have non-cytosolic N- and C-termini and appear to come from a different duplication event than the nucleotide sugar transporters. We have discovered two independent branches within EamA that formed after the radiation of Viridiplantae. These independent branches are called "AMAC" and "SLC35C/E", from their human constituents. The AMACs (and fatty acid elongase 3-ketoacyl-CoA synthase 1) are probably not enzymes, but solute carriers similar to nucleotide sugar transporters. A new motif has been characterized, G-X(6)-G, strongly overrepresented in the 5th TM helix of DMT-2.
Complete protein sequence data for 12 model organisms was obtained on August 1, 2009 from the following databases: Ensembl, JGI, Dictybase, TAIR. A Pfam script, 'pfam_scan.pl' (v1.21), was then used to search the protein sequences with the Pfam-A (v23) hidden Markov models  for each of the ten DMT families represented in human, using Pfam's default parameters, assuring that hits did not overlap.
The sequences were aligned, family-by-family, using MAFFT-EINSI (v6.624b) , producing 10 alignments. The settings used were: defaults, 10 rebuilds. Transmembrane predictions with Phobius (v1.04) , a leading transmembrane topology predictor, using default settings, were made for all sequences. The alignments and transmembrane predictions were viewed in Jalview (v2.5.1) . The alignments were edited to remove poorly aligned sequences, using the Maxalign (v1.1)  exclude selection, with the additional goal to retain all human sequences and achieve aligning TM segments in >80% of the sequences for each conserved TM block. All editing was done by sequence removal, i.e. not by allowing insertion of gaps. The edited alignments can be found in the additional material (additional file 4: dmt.aln.tgz), and the new number of sequences after this step is indicated in parentheses in Table Table11.
Assuming, from previous theory , that DMT containing proteins consist of either one or two units of 4-5 TM segments, two domain-containing alignments were divided in two halves, using the following methods. Seven out of ten of the human DMTs can be divided by symmetry, except: DUF803, Cation efflux, and Zip families. DLP-SVM  was applied on canonical sequences from the asymmetric families, with offset 30, threshold 0.5, and rank 1 [see Figure Figure11].
To successfully divide the DUF803 alignment, two "generic" HMMs (one for DMT-1 and one for DMT-2), were first trained on the first and second domains in the seven symmetric human DMT families using HMMER 3. We used hmmbuild, hmmpress, hmmscan --tblout against the sequences. Each of these generic HMMs were then applied to the sequences in the DUF803 alignment using the criteria for classification as having more than 50% of the sequences in the proposed domain exceeding a 1e-10 E-value cutoff. In the Cation efflux and Zip families, a domain border was determined using identification of ~100 residue length alignment gaps concurring with low complexity regions in the dominant Pfam architecture [Table [Table66].
RAxML-III  was used via the 'easyRax.pl' script with the 'fast & easy' settings as described in the manual: bootstrap (BS) maximum likelihood protocol, WAG model, estimate proportion of invariable sites, empirical base frequencies, generating 100 bootstraps. The BS forests were edited with aid of two tools, Summary Tree Explorer (v1; STE), an open-source Java application for interactively exploring sets of phylogenetic dendrograms developed by Mark Derthick, Carnegie Mellon University, Pittsburgh, and "P4" (v0.88.r142), a Python package for phylogenetics developed by Peter G. Foster, The Natural History Museum, London.
The goal was to produce 10 phylogenetic trees that do not contain any nodes that have bootstrap support falling under 50%, by removing sequences that do not stably integrate in any branch system (subfamily). Ten dendrograms having bootstrap >50% were made by iteratively removing a group of <10 unstably clustering sequences, re-aligning the sequences and re-generating the bootstrap forest, and making a new assessment of the phylogenetic tree. This process was repeated until the dendrograms were resolved. STE can be used to see the probable effect of a certain sequence removal operation in advance, to save the time and effort of realigning and re-generating the bootstrap forest. The experimental work in STE is based on Leaf Support Values from P4, showing which sequences are most likely to be unstably clustered.
Assuming that the human nucleotide sugar transport families (based on UniProt annotation) that have a 5+5 TM structure are likely to have a common origin, we used HHsearch  to train hidden Markov models (HMMs) on the first and second domains of human 5+5 TM structure nucleotide sugar transport DMTs: DUF914, EamA, NST, TPT, UAA. The alignments were converted to 'a3m' format (using reformat.pl with the -M 50 flag) and also subjected to buildali.pl (using 0 PSI-BLAST iterations), to enable calculation of probability of homology. We used a common calibration database containing SCOP folds which was provided with HHsearch, to be able to calculate E-values and generate comparable models. We compared HMMs against HMMs, not against sequence. We used a 99.05% HHsearch probability cutoff to count so called "edges", because 99.05% was the highest cutoff that retained a connected graph. We then applied Euler's planar graph theorem to test if the graph was planar. We loaded the planar graph in Cytoscape 2.6.3 , and organized the graph such that it had no overlapping edges [Figure [Figure3].3]. From this, we postulated a hypothesis as to which of the five families (DUF914, EamA, NST, TPT, UAA) was most likely to be the origin of the other four families, based on which family had the highest degree, d.
Non-metric multidimensional scaling was performed using the ALSCAL algorithm , as implemented in SPSS v.14, with the s-stress convergence parameter set at 0.0001. Similarity values were transformed into dissimilarities through the rule (100 - similarity %) prior to ALSCAL analysis.
We used the HHsearch (HMM) consensus sequences from each pair-wise comparison of nucleotide sugar transporters, comparing domain halves presumed (from Results) to have evolved from each other, to study how this was reflected on the sequence level. We recorded matching conserved residues in the aligned consensus sequences, as can be found in the '.hhr' files generated by HHsearch. Furthermore, MEME (http://meme.sdsc.edu/meme4_6_0/cgi-bin/meme.cgi) was used, with default settings on the EamA alignment, to identify the three most common motifs in EamA.
Two branches that lack proteins from A. thaliana, and hence appear to have formed in Animalia, have been found in our resolved EamA dendrogram: AMAC and SLC35C/E (see Figure Figure2).2). The oldest model organism sequences, [Dictybase:DDB0184471] and [JGI:e_gw18.104.22.168], were used as queries in TBLASTN 2.2.24+ searches of the "nr" database. The results showed that [Dictybase:DDB0184471] is a TMEM20-like sequence and [JGI:e_gw22.214.171.124] an SLC35E1-like sequence, confirming their identity as seen in the resolved EamA phylogenetic tree. These sequences were annotated in DDBJ, using third party annotation (TPA), according to §15 of INSDC TPA policy.
We took the full-length TMEM20 and SLC35E1 sequences from H. sapiens [Ensembl:NM_001134658, Ensembl:NM_024881], and used them as queries in TBLASTN 2.2.24+ searches of "nr", limiting the search to a set of 12 model organisms (different from selection in Methods - except retention of the species containing oldest model organism sequences (underlined)): P. troglodytes (Ptr), R. norvegicus (Rno), T. guttata (Tgu), C. milii (Cmi), S. purpuratus (Spu), N. vitripennis (Nvi), T. castaneum (Tca), H. magnipapillata (Hma), T. adhaerens (Tad), C. albicans (Cal), D. discoideum (Ddi),A. thaliana (Ath). We recorded the TBLASTN 2.2.24+ score for these hits, and observed a drop in score (of approx. 1/3) in or after the organisms containing the oldest model organism sequence found in our resolved dendrogram, i.e. T. adhaerens and D. discoideum.
We established in Methods that the AMAC and SLC35C/E subfamilies were present in D. discoideum (TMEM20) and T. adhaerens (SLC35E1), respectively. To confirm that no other constituent of the AMAC and SLC35C/E subfamilies, such as TMEM22 or SLC35C1, are actually older, we performed TBLASTN 2.2.24+ searches with full-length TMEM22 [GenBank:AAH22557] and SLC35C1 [GenBank:AAH01427] against "nr" plants (taxid:3193).
We took the oldest model organism sequences from each subbranch of AMAC, SLC35C/E and SLC35F: D. discoideum [DDBJ:BR000891] (TMEM20), T. adhaerens [DDBJ:BR000889] (SLC35E1), and A. thaliana SLC35F5 [NCBI RefSeq: NP_187364]. We compared these sequences with their human counterparts ([GenBank: AAI04815.1], [SwissProt: Q96K37.2], [GenBank:AAH18537.1]) in the following way.
The second DMT domain of the sequences were first aligned in MAFFT-EINSI (v6.624b) with default settings, using 10 rebuilds, and converted to Nexus format. A guide tree is made using TreePuzzle (v5.2), using the tree reconstruction option, quartet puzzling tree search procedure, clocklike branch lengths off, approximate parameter estimates, "VT" substitution model, and a uniform model of rate heterogeneity. Secondly, the guide tree was fed to TreePuzzle with the following options to calculate distances. Tree reconstruction option, uses user defined tree search procedure, clocklike branch lengths on, exact parameter estimates, "JTT" substitution model, mixed model of rate heterogeneity (1 invariable + 8 gamma rates). The A. thaliana SLC35F5 orthologue was specified as outgroup.
Selecting 9 additional DMT families not found in human, and selecting 9 plant organisms [additional file 16: supplementary table S12], we downloaded the full sequence set for 19 DMT families [additional file 2: supplementary table S2; additional file 10: supplementary table S7] using UniProt accession numbers as queries, to undertake a brief survey. Because this exercise is only performed for qualitative comparison purpose with bacteria, pfam_scan.pl is not used.
We also recorded the number of sequences in the Pfam species distribution on the Pfam website for Cyanobacteria, Proteobacteria, Bacteroidetes, Actinobacteria, Firmicutes, Archaea [additional file 11: supplementary table S8]. These sequences were aligned as in Methods, and domain architecture was determined in relation to TM structure as in Methods.
In addition to the abovementioned retrievals in 4.12, we also obtained all seed sequences for each of the non-human DMT families and used this as the basis for the ensuing HMM work with non-human DMTs.
HMMs were trained on bacterial DMT domains, using HHsearch as in Methods [additional file 14: dmt-hhm.tgz]. The sequences were taken from Pfam full or "seed" data sets, if the "seed" contained >= 10 sequences. These sequences were aligned and a first domain HMM was trained on them, as in previous methods. A HMM was trained on subsections of EamA alignment that contained only single copy of EamA (1 × EamA). The full set first domain EamA and 1 × EamA HHsearch HMMs were queried against the single domain families: UPF0546, UPF0060, DUF486 and MDR.
From the alignments of two domain DMT families, excluding EamA derived cases that have been analysed by HMM consensus sequence comparison, any glycine (G) position with >=65% frequency inside a TM segment, was recorded in [additional file 13: supplementary table S10]. The hypothesis from the EamA HMMs was an enrichment of G-X(6)-G constellations in the 5th TM segment of DMT-2 domains.
Assuming a glycine frequency of 7-8%, and simulating in a Perl script random sequences of glycines and non-glycines, in TM helices ~20 residues long, we established that on average, 7-8% of the TM helices should have at least one G-X(6)-G. Thus, in the ~70 (67) TM segments in [additional file 13: supplementary table S10], we expected to see ~5 TM segments with G-X(6)-G. Furthermore, in the 5th TM segment of DMT-2 domains (7 TM segments), we would expect to find only ~0.5 TM segments with G-X(6)-G.
A breadth-first clustering was made using HHsearch similarity score to closest neighbour for first domain, using 19 DMT families. "Breadth first" is defined as a clustering strategy where clusters are expanded breadth-wise. In the cartoon representation (Figure (Figure5),5), the spatial orientation of nearest domain family to a given query domain family is arbitrary. Note that this strategy obviates the need to use a clustering cutoff and deterministically returns the same number of clusters for the same input data.
The Phobius prediction for each alignment is used to determine the prevalent membrane orientation. Intercluster distance, i.e. the highest scoring intercluster connector of all possibilities, was determined from HHsearch scores. Furthermore, we identified the DMT family having the lowest HHsearch similarity score to its nearest neighbour, thus defining the lower limit of similarity of any DMT family to the other families.
AMAC: acyl-malonyl condensing enzyme; FAE: fatty acid elongase; DUF: domain unknown function; UPF: unknown protein function; DMT-1: first DMT domain; DMT-2: second DMT domain.
HBS, RF conceived the project. MSA designed the original workflow. AV, MWS and MSA completed the analysis. AV, MWS, MSA, and HBS wrote the paper. All authors read and approved the final manuscript.
List of model organisms in study. List of name, kingdom, phylum, class, divergence time from H. sapiens, database, and reason for inclusion. The asterisk indicates that the divergence time is the average estimate in Time Tree, not the TimeTree "expert" estimate.
DMT families found in H. sapiens. List of DMT name, Pfam identifier, description, and status whether present in H. sapiens. DUF stands for domain unknown function, and UPF stands for unknown protein function.
All DMT sequences from Pfam mining (Methods) except sequences removed in alignment editing. List of all sequences in the mining after alignment editing.
Alignments for 19 DMT families in Jalview format. Alignments for 19 DMT families in Jalview format. The archive file contains alignment files that can be loaded in Jalview, and '.mup' files that contain Phobius TM predictions and pfam_scan domain location predictions as Jalview loadable markup language.
Summary of evidence used to define alignment border. The table lists DMT families from [additional file 2: supplementary table S2], and whether alignment border can be supported by the following evidence types: symmetry, support from Jack DL, Yang NM, Saier MH, Jr. (2001), Pfam low complexity region, Jalview Quality Track (JQT), length gap, domain linker peptide SVM.
Unedited bipartitions tree of maximum likelihood bootstrap forest of first domain of EamA DMTs. This figure is included for comparison purpose with Figure 2. There is no lower bootrstrap support cutoff. The number of sequences is the same as in the parenthesized numbers in [Table [Table11].
Resolved dendrograms for human DMT-1, except EamA (treated in paper). The file contains the resolved dendrograms for: Cation efflux, TPT, UAA, NST, Zip, DUF914, DUF803, DUF1632, and UPF0546.
Table listing number of DMT sequences in resolved maximum likelihood bootstrap forests. The table lists number of DMT sequences in resolved maximum likelihood bootstrap forests. Due to the editing necessary to achieve dendrogram resolution, the sequence numbers are reduced as compared to Table Table11.
Table listing the number of sequences in independent branches found in resolved bootstrap forests. Summary for three of the bootstrap forests that contained independent subfamilies, showing the number of sequences in the branch system in the model organisms in [additional file 1: supplementary table S1]. The three-letter abbreviations are taken from the Latin names. The "chicken-specific branch" exists in older organisms than chicken, but does not contain any model organism sequences in M. musculus and H. sapiens. The numbers indicated can be subtracted from [Table [Table1]1] to obtain the number of sequences not members of the independent branches.
Table listing DMT families not present in human. The table lists the DMT name, Pfam identifier, description, and condition whether present in H. sapiens. The recommended domain border is shown, following Methods. All the DMTs not found in H. sapiens are either symmetric 5+5 or single domain DMTs.
Extraction of 19 DMTs in bacteria. Table lists number of bacterial full-length sequences found in Pfam species distribution for the 19 DMT families. Comparing these numbers to the numbers in plants in Table Table66 shows that the distribution pattern is drastically different, where only EamA, Cation efflux and Zip families are represented in large numbers between these tables.
Complete listing of sequences from Pfam seed or full sequence for the DMT families not present in human. The DMTs are the non-human DMTs in [additional file 10: supplementary table S7]. The table lists DMT type, UniProt identifier, whether Bacteria, Archaea, or Eukaryota, the phylum, and full name. UniProt identifiers that are listed without species details are present in the current version of Pfam, but obsolete in UniProt. If the Pfam seed was smaller than 10, the full sequence set was used.
Presence of glycine constellations in TM segments in first and second DMT domains in two domain DMTs not found to be derived from EamA, excluding Cation efflux. Using the sequence conservation criteria in the methods section (>65%), the glycine constellations are found in two-domain DMTs not found to be derived from EamA. The presumed domain border of the DMTs is indicated in brackets in A+B form. N/A means that the TM does not exist in the given protein. TM(B) indicates a TM in the DMT-2 domain. The notation G(6)G indicates two glycines separated by six residues, i.e. G-X(6)-G. Cation efflux does not contain the G6G domain (see Results). SugT, RhaT, and FAE are not found in H. sapiens.
Archive of HMM files. The gzipped archive contains the 32 HMM files, one for each first domain of the 19 DMTs, and 2nd domains for the DMT families that have a second domain. The archive also contains 'ALL HHS.ods', a spreadsheet containing the values for all pairwise comparisons between the HMMs.
Known substrates in H. sapiens of DMT nucleotide sugar transporters. The data are taken from UniProt annotation, having "reviewed" status.
Table listing 13 plant organisms, used in a separate extraction of DMT in plants. The table lists UniProt identifier, species name, common name, and reason for inclusion. The average divergence time, taken from TimeTree, is the average distance to the other representatives of Monocots, Dicots, Gymnosperms, Bryophytes, and Algae, from current example excluding its classification from the average.
We thank Samira Yazdi for completing the initial Pfam data mining; Karl JV Nordström (Max Planck Institute for Plant Breeding Research, Cologne, Germany) for designing parts of the original workflow; and Takehide Kosuge, for kind support with third party annotation in DDBJ. The study was supported by the Swedish Research Council, The Novo Nordisk Foundation, Swedish Royal Academy of Sciences, and Magnus Bergvall Foundation. RF was supported by the Göran Gustafssons foundation. Thank you Prof. Mary Norval, Edinburgh University, and Jonathan Burgos, for your advice on this project.