Identification of new members of the ependymin protein family
The InterPro [26
] Epd protein family (IPR001299) is composed of protein sequences bearing only the Epd domain. The architecture and signatures defining this domain are very distinctive [6
]. The fact that there are no domains/families known to be related to [InterPro: IPR001299] means that a protein can be assigned to the Epd family if it displays the sequence patterns established for the domain.
Taking advantage of recently sequenced genomes and ongoing EST projects, we were able to find 39 new members of the Epd protein family [see additional file 1
: Table_S1]. The assignment of each sequence to this family was corroborated by searches in the Conserved Protein Domains (CDD) database [27
], detection of potential N-glycosylation sites [28
], comparison of hydropathic profiles [29
], prediction of cysteines participating in disulfide bond formation [30
], and the presence of certain amino acids in conserved positions as compared to previously well-known Epd proteins. Most of the sequences passed every one of these tests. Cases where candidate ESTs possessed an N-terminal sequence were also verified by signal peptide screening [31
]. All the newly identified epd
sequences had significant similarity only with other Epd proteins as determined by Position-Specific Iterated Blast (PSI-BLAST) and standard BlastP searches [see additional file 2
: Additional_Text for detailed methods].
Interestingly, four of the new epd
sequences found during this study were from protostomes [see additional file 1
: Table_S1]: the mollusks Crassostrea gigas
(Oyster_Cgi), Crassostrea virginica
(Aplysia), and Biomphalaria glabrata
(Biomphala). The working names used here are given in parenthesis. The finding of Epd proteins in mollusks is remarkable since the Epd proteins were once thought to be chordate specific [9
] and later deuterostome-specific [12
]. The presence of Epd in mollusks clearly indicates that this family is older than previously thought and its origin can be traced at least to the origin of the protostomes. We failed to find Epd domain-containing proteins in other protostomes for which genome and EST data is available, such us Drosophila
and Caenorhabditis elegans
, but this does not rule out the possibility of finding additional family members in other protostomes once more metazoan species are sequenced. In addition, it has recently been proposed that model species such as Drosophila
and C. elegans
have suffered extensive gene loss during their natural history and that this hinders their utility in big scale genome comparisons [32
]. An alternative explanation is that the Epd protein family is restricted to the deuterostomes and lophotrochozoans while absent from the ecdysozoans.
Additional new members of the Epd protein family were found in the taxonomic groups: Echinodermata, Urochordata, Cephalochordata, Amphibia, Aves, Elasmobranchii, and fishes [see additional file 1
: Table_S1]. In the latter group, it is notable that this gene was previously known to be only expressed in the brain of teleost fishes [13
]. However, we found several new members of the protein family expressed in other fish tissues that appear to be duplications of the previously described brain genes. Thus, besides the genes known to be expressed in brain, we found three additional epd
genes in zebrafish and Fugu
; two additional copies in Tetraodon
, medaka, and salmon; and an additional copy in the catfish Ictalurus punctatus
. All these new genes have a broader expression pattern (non-restricted to brain tissue), as can be inferred from the tissue source information deposited by the authors in the corresponding ESTs databases.
The 39 newly identified Epd proteins [see additional file 1
: Table_S1] were combined with the 35 previously recognized members [see additional file 3
: Table_S2] to produce an alignment of 74 members of the family which served as the basis for our molecular evolutionary analyses.
The protein sequence logo [33
] representing the alignment of all the 74 Epd sequences (Figure ) confirmed that amino (N) and carboxyl (C) ends are not well conserved, increasing similarity in the Epd domain region as pointed out in previous studies [6
]. Inside the domain, several amino acids provided a hallmark of Epd molecules. The most relevant were four cysteines within the molecule primary sequence at positions 45, 118, 188 and 232 (numbered according to the WebLogo; Figure ). These residues have also been predicted by the Disulfind server [30
] to participate in disulfide bond formation, pointing to a key role in protein folding and/or dimeric interactions and ultimately to the biological function of the Epds [2
]. Two proteins (Ictalur_GS and Danio_Tj) were found to lack one of these Cys but this was probably due to sequencing errors on these ESTs. Pro residues were usually found one to three residues from the Cys, particularly next to the first and fourth Cys. In addition, all Epds showed an Asp about 29 residues down from the initial Cys (D74
in the WebLogo alignment; Figure ). A Tyr residue (Y73
) was found to precede the common Asp in all species except for echinoderms which showed a Phe at this position. Other residues common to Epds included: (i) the Gly (G146
) at about 28 residues from the second Cys, (ii) the Pro (P134
) at a halfway distance between the second Cys and the common Gly, and (iii) the Trp (W160
), 11–14 residues after the common Gly. Only 3 species lack this Trp (Oysters have either a Tyr or a Phe, and the tunicate Diplosoma listerianum
has a Tyr). Also notable was the high conservation of the residues P44
, and P229
. In view of their high positional conservation, and the yet to be determined tridimensional structures of Epd proteins, we can only suggest that these amino acids are especially important to the overall Epd function.
Figure 1 Sequence logo representation of the ependymin protein family. Overall comparison among all the 74 ependymin protein sequences used in this study in which the height of a given letter (amino acid residue) represents its frequency of occurrence at that (more ...)
The Epd protein family members can show highly divergent amino acid sequences but still maintain very similar hydropathy profiles, suggesting that the overall functional properties of the proteins are conserved [6
]. These hydropathic profiles have been used to link Epd amino acid similarity with their properties as secreted glycoproteins [6
]. As shown in Figure , the hydropathic profiles of Epd proteins are very similar. Thus, this family of proteins is mainly hydrophilic without transmembrane domains, but possesses a small, highly hydrophobic, region corresponding to the N-terminal signal peptide typical of secreted proteins.
Figure 2 Comparison of hydropathic profiles for selected ependymin proteins. The hydropathic profiles of two previously recognized ependymin proteins, goldfish 1 (1Carassius) and human (Epdr1_Homo) are compared with the profiles of a subset of the new members (more ...)
sequences were found in the zebrafish genome, all of which mapped to different linkage groups (LG) in the zebrafish assembly Zv6. The first sequence was the previously well-known zebrafish epd
(mapped to the LG 5). The other zebrafish Epd genes analyzed were: Danio_Tj
(LG 7), Dan_LvItEm
(LG 21) and Danio_MERP
(LG 2) [see additional file 4
: Table_S3 for detailed results from mapping]. Two of the zebrafish protein sequences (Danio_Bra and Danio_MERP) had higher similarity to other vertebrate Epd sequences such as the goldfish and the human Epd protein than to the other zebrafish sequences (Figure ). For instance, the amino acid similarity between the Danio_Bra and the goldfish Epd (1Carassius) expressed in brain was 97.2%, the similarity between the Danio_MERP and the human Epd protein (Epdr1_Homo) expressed in several tissues was 75.9%, while the similarity between Danio_Bra and Danio_MERP was only 60.5%.
Figure 3 Similarity relationship among the zebrafish ependymins and two previously known ependymin sequences. The amino acid similarity of two zebrafish ependymins (Danio_Bra and Danio_MERP) is higher across species (e.g., goldfish and human) than among the four (more ...)
Using genome and ESTs information from Tetraodon
, we found that three different epd
genes (Tetraod_Br, Tetraod_Tj, and Tetrao_MEL
) mapped to different genomic locations [see additional file 4
: Table_S3]. Pairwise comparisons of the three Tetraodon
protein sequences yielded percentages of similarity ranging between 35% and 43%, much less than the similarity expected if they were the same gene. Since the Tetraod_Br
genes were predicted from genome sequence and are not well supported by Tetraodon
cDNAs (as is the case of the Tetrao_MEL
gene), we have proposed that they are expressed by different tissues (i.e., brain and non-brain tissue), according to their location in the preliminary gene trees that we generated from the gathered data.
We found four Fugu
Epd sequences (Fugu_Brain, Fugu_HerGi, Fugu_Tj
, and Fugu_MERP
) localized to different scaffolds of the current Fugu
genome assembly [see additional file 4
: Table_S3]. Amino acid similarity in pairwise comparisons among them ranged from 50.7% to 66.1%.
Three different medaka Epd sequences (Medaka_LW1
, and Medak_MERP
) were found and mapped to different scaffolds in the draft assembly of the medaka (Oryzias latipes
) HdrR genome [see additional file 4
: Table_S3]. Medaka_LW1 and Medaka_LW2 Epd proteins were 80% similar and only 46.5% identical, and each one of them differed from Medak_MERP by 41.3% and 43.11%, respectively.
Two Ciona epd
domain-containing genes (Ciona_Tun1, Ciona_Tun2
) were predicted from the Ciona intestinalis
genome assembly [see additional file 4
: Table_S3]. Although the two Ciona Epds were located on the same chromosome (12q), they were mapped to different regions and each one was encoded by several non-overlapping ESTs. The identity between the Ciona
Epds was only 33%, providing further support for being different genes.
We were able to map only one epd
gene for the frog (Xenopus tropicalis
), chicken, human, mouse, rat, and chimpanzee genomes [see additional file 4
: Table_S3]. Although two different mouse Epd proteins were available on the databases, our analyses indicated that they may not be two different genes [see additional file 2
: Additional_Text for details on this issue]. We doubt that the Epdr1_Mus
] is a mouse gene, but we included this sequence as well as the reliable mouse sequence Epdr2_Mus
] in subsequent analyses since they are both currently acknowledged in GenBank as separate mouse epd
Phylogenetic distribution and relationships of ependymins
Since previously published phylogenies included only 25 sequences expressed in fish brain [6
], a few sequences from mammals, one from an amphibian [7
], and three from echinoderms [12
], our dataset of 74 family members represents a substantially more comprehensive sample of Epd diversity. Our results using various methods for phylogenetic inference (Bayesian, Maximum Likelihood, Neighbor-Joining, and Maximum Parsimony) showed essentially the same topology (Figure ). The sole difference was that with maximum parsimony, several unresolved polytomies were obtained in the more derived branches within each main clade (data not shown). We note that bootstrap values and posterior probabilities were low for certain branches. However, an assumption for the bootstrapping method is even distribution of the phylogenetic signal throughout the data set [37
]. In gene families like epd
, with complete conservation of some sites and large divergence in others across the phylogeny, this assumption is not met. Thus, high bootstrap values are not always expected to be obtained for many nodes, because the sites supporting the existence of the main clades may differ from the sites that are useful for resolving the relationships among more derived groups [38
Figure 4 Phylogenetic analysis of the ependymin protein family. The results from a Maximum likelihood (ML) bootstrap analysis are shown above the branches, whereas the values below the branches result from a Neighbor-Joining (NJ) bootstrap analysis. The dashed (more ...)
Four clade groups could be inferred from our phylogenetic analyses (Figure ): a first group that we have called "FishBrain" is composed of Epd proteins specifically expressed in teleost fish brain. The original Epd sequences were located in this group which has been the best studied group with more than 30 member sequences. In fact, most of the available Epd sequences from fishes came from an extensive study [6
] which used brain tissue to clone orthologues of goldfish epd
in other teleost fishes.
A second group that we named "FishTj" has remained unnoticed until now. This group is only present in fishes, but unlike the FishBrain group, its expression is not restricted to the brain. This group was represented by twelve Epd sequences that we placed as a putative sister clade to the FishBrain group (Figure ). The FishTj group was composed of complete genes obtained from genome and ESTs sequencing projects in which the tissue was not brain derived. In fact, recently the EST sequences Salmo_Tj
from Salmo salar
from the medaka fish Oryzias latipes
were shown to be upregulated in immunologically challenged liver: in the case of the salmonid sequence, after being exposed to the pathogen Aeromonas salmonicida
], and in the case of medaka, after being exposed to the aryl hydrocarbon receptor agonist TCDD [40
]. Thus, from the response displayed by these animals under toxic stress, we suggest that these FishTj Epds might be involved in repair following hepatic injury.
All the teleost fish species that have an epd gene in the FishTj group also have an unlinked epd copy in the FishBrain group (Figure ). The only exception to this observation is the medaka where no brain-specific Epd protein was found, probably due to incomplete sequencing or assembling. In our results, the bootstrap value and posterior probability that support the FishTj group as a monophyletic clade are low, but the cohesion among the FishTj proteins is graphically strong. We predict that as more piscine epd sequences isolated from tissues other than brain become available, the support value for this group will raise dramatically.
It is interesting to note the position of the shark Epd sequence [Shark_Squa] (Figure ). With the current information, this sequence was placed as the putative root of a big clade containing sequences only present in fishes that became divided into the FishBrain and FishTj paralog groups. Since this part of the tree resembles the duplication topology described by Meyer and colleagues [18
], it could reflect the whole genome duplication postulated to have occurred during the natural history of teleost fishes [41
]. Thus, it will be important to determine if the shark position remains unchanged after the addition of more sequences.
The monophyly of the third group that can be recognized from the tree (Figure ) was highly supported. This group was composed of protein sequences isolated from echinoderms, amphibians, birds, mammals, an Elasmobranchii [Raja_erina], and four sequences from teleost fishes [Danio_MERP, Medak_MERP, Fugu_MERP, and Tetrao_MEL]. Since the first acknowledged member of this group was the human gene epdr1
formerly called MERP1
], we have named this clade the "MERPs" group. According to published experimental evidence [7
], and to information deposited in the databases, these sequences were isolated from a variety of tissue sources, including but not restricted to brain, heart, skeletal muscle, prostate, kidney, liver, small intestine, colon, spleen and gonads in human and mouse, and in intestine, esophagus, mesenteries, gonads, respiratory trees, and tentacles in echinoderms. Clearly, the expression pattern of the Epd proteins in the MERPs group is not tissue-specific.
Although statistical support was poor (Figure ), we also postulate the existence of a fourth Epd group that includes the evolutionary more basal species. This group, that we named "Basal", clusters epd
genes from protostomes (mollusks) and deuterostomes (tunicates and amphioxus). Since all Basal group Epds are from invertebrates, we expected echinoderm sequences to fall within this group as well. Therefore, we performed the SH [42
], KH and RELL [43
] tests to compare the expected placement of echinoderm sequences within the Basal group as opposed to their placement inside the MERPs group observed in the inferred tree topology (Figure ). However, all tests indicated significantly better support (p
< 0.0001) for their assignment to the MERPs group. A notable misplacement in the Basal group was the location of the epd
gene from oysters as sister to the amphioxus epd
gene; instead of being sister group of the other mollusks: Aplysia californica
[Aplysia] and Biomphalaria glabrata
[Biomphala]. This misplacement (Figure ) did not fit the data significantly better than the expected phylogenetic placement of the oysters as sisters of the other mollusks (p
KH = 0.107, p
SH = 0.112, p
RELL = 0.101).
Interestingly, the epd
sequence isolated from the oyster Crassostrea gigas
[Oyster_Cgi] has been recently shown to be up-regulated in the digestive gland of these animals after a week of exposure to hydrocarbon contamination [44
]. Since the digestive gland, similar to the liver, is known to plays major roles in metabolism and detoxification, this may imply that Epds in mollusks and vertebrates have conserved functions.
We expected that within each Epd paralog group, the species gene tree would be obtained after phylogenetic reconstruction. This was achieved for the FishBrain group where the branching pattern agreed considerably with previous studies focused on the phylogenetic relationship of teleost fishes [6
]. In spite of the awkward, but statistically well supported positioning of the Epd sequences from echinoderms, the overall branching pattern inside the MERPs group also agreed satisfactorily with the expected species tree. However sampling bias against the metazoan groups that may carry genes belonging to any of the previously unidentified Epd groups (FishTj and Basal) appears to be the main impediment to obtain good statistical support for their monophyly and resolve the species relationships among the proteins inside each group. Thus, our reconstruction of the Epd phylogeny strongly agrees with the 2R hypothesis [41
], providing a good example for the two rounds of genome duplications proposed to have occurred early in the vertebrate lineage; all exceptions found in the inferred topology are highlighted by low support values.
Descriptive analysis of the ependymin protein family subgroups
To analyze the particular characteristics of each Epd group we selected only complete sequences with the initial methionine and the final stop codon. These included 13 Epds from the FishBrain group, 11 from the MERPs, 9 from the FishTj, and 6 from the Basal. Table shows the major findings of these comparisons. If we use the Basal group to represent the putative original molecule it serves as a point of comparison on how the other group molecules have diverged. We are conscious that this Basal group contains species from three different phyla that might be quite distant from one another. Nonetheless, if we do the same analysis using only the two mollusk complete sequences, the variability they show between them is similar to that when compared with the other two phyla, thus essentially the same results are obtained.
Quantitative survey of ependymin protein features (Average ± SE)
When compared to the Basal group, Epds from the other three groups show an increase in size. This increase is particularly evident in the MERPs with an average increase in size of 25 residues (or about 13%) over the Basal group. Our sequence comparisons show that it is in terms of the amino acid composition that the groups show highly significant differences. The predicted isoelectric point of the Basal group varies from 5 to 6.45. However, the FishBrain Epds show an acidic isoelectric point of around 5.1, while the isoelectric point of MERPs and FishTj Epds is around 6.5. The different isoelectric points are mainly due to a larger number of acidic residues and a decrease in basic residues found in FishBrain Epds, making the ratio of acidic to basic residues almost double that of other Epds.
Other differences among the Epd groups are noticeable. FishTj Epds have a significantly higher aliphatic index than the other groups. However, at the amino acid level, it is surprising that the number of Phe residues is almost halved in the MERPs in comparison to other groups. Similarly, the number of Trp residues in the molecules shows significant differences among the groups. The Basal species and the FishBrain molecules have an average of 2 Trps in their sequences, which increases to 3.2 in FishTj and to 6.6 in MERPs; this latter change is astonishing given that Trp is the largest and rarest amino acid. A similar trend is observed with the number of Pro residues, although the number only increases 50% between the Basal group and the MERPs. An additional comparative analysis of the predicted amino acid modifications in Epd proteins, including N-myristoylation, N-glycosylation and phosphorylation sites is available [see additional file 2
Apart from the amino acid residues that characterize the protein family (Figure ); there are several amino acid features that typify each one Epd subgroup. These group-specific signatures were revealed using sequence logos generated from sub-alignments containing only the members of each group (Figure ) [see additional file 2
: Additional_Text]. This analysis clearly shows a divergent pattern of amino acid usage and conservation among the Epd family subgroups. This pattern suggests differences in selective constraints, likely arising from divergence in structural and functional aspects of the proteins phenotype.
Figure 5 Comparative analysis of the four ependymin protein family groups. WebLogos were created from alignments that only included the sequences belonging to each ependymin group: (A) FishBrain – 33 sequences. (B) FishTj – 12 sequences. (C) MERPs (more ...)
Analysis of selective pressures acting over the ependymin protein family
The fact that some Basal group sequences were found in protostomes (mollusks), lead us to suggest that this group represents the evolutionary origin of the Epd molecules. Alternatively, it may indicate that the Epd protein evolved in an ancestor of the protostomes and deuterostomes, but expanded into a gene family only within the deuterostomes. Under either scenario, the MERPs, FishBrain and FishTj groups represent the more derived members of the Epd protein family.
Different selective pressures acting over each Epd paralog group could have favored fixation of different sequences in each metazoan genome. To investigate the divergence of Epd groups, we used the rate of nonsynonymous (dN
) and synonymous (dS
) nucleotide substitution ratio (ω
), as implemented in codon models of molecular evolution [47
]. In this statistical approach, an ω
< 1 indicates the action of purifying selection (i.e., a selective constraint against mutations that negatively impact the function of the protein); ω
= 1 is consistent with neutral evolution, and ω
> 1 indicates positive Darwinian selection (i.e., favoring the fixation of beneficial amino acid changes) [49
We used the likelihood ratio test (LRT) statistic to determine if the selective pressure is significantly different between postduplication (PD) and postspeciation (PS) branches in the Epd phylogeny (Figure ). We fitted two different PD-PS models to our data (see Methods) and contrasted each PD-PS model against a one-ratio model (M0) that assumed that PD branches as well as PS branches were subjected to the same selective pressure. The LRTs gave significantly higher support (p
≤ 0.0003) to all the tested PD-PS models over the M0 model [see additional file 5
: Table_S4]. Estimates of PD-PS model parameters suggest that (i) just after the duplication events, the fixation of amino acid changes increased in PD branches, and (ii) the rate of amino acid evolution decreased in PS branches, presumably due to more stringent levels of purifying selection. For instance, estimates under the Mps1
model were: ω(PD)
= 1.139, ω(PS)
= 0.161 and background ωb
ratio = 0.056. In both tested PD-PS models dN
values averaged 0.089 and dS
values averaged 0.741.
Furthermore, our branch based analyses suggest that each Epd paralog group has been subjected to a different selective pressure during their natural history. The paralog models (Mp1
, and Mp3
; see Methods) that allow for paralog-specific differences in selection pressure provided a significantly better explanation of the data (p < 0.0001, [see additional file 5
: Table_S4]) than did the one-ratio model (M0) assuming no differences in selection pressure among Epd paralog groups.
Since each Epd group has particular amino acid features that differentiate it from other paralogs (Figure ); we also expected the selective pressure to vary among sites and among Epd paralog groups. Therefore, we applied site-models of codon evolution to evaluate this scenario using each of the Epd groups (FishBrain, FishTj, MERPs and Basal) as separate data sets. When we applied the one-ratio site-model (M0) to each sub-dataset, the estimates of the ω
ratio averaged over all sites for each paralog group (i.e., ωMERPs
= 0.115, ωFishBrain
= 0.182, ωFishTj
= 0.195, ωBasal
= 0.051) were all consistent with the ω
estimates previously obtained under the Mp1 Paralog model [see additional file 5
: Table_S4]. However, a LRT contrasting these separate analyses under M0 to M3, which allows among-site variation in the selective pressure, revealed significant heterogeneity in selective pressures within each member of the Epd family (p
< 0.0001, [see additional file 5
: Table_S4]). We note that this is not an unexpected result, as genes encoding functional protein products typically exhibit significant variation in selection pressure among sites. In addition, all Epd paralogs had a class of sites subjected to a rather strong purifying selection (ωo
ranging form 0.003 to 0.014) but the fraction of such sites varies widely among paralogs (from 10% to 30%, [see additional file 5
Signature residues for the entire Epd family (Figure ), such as the four strictly conserved Cys, are predicted to be critical for the common biological function of all Epd genes. In this context, we expected such residues to be localized in codon sites subjected to very strong purifying selection against nonsynonymous changes. When we applied the codon site models to the data set comprised of 70 Epd proteins (see Methods), we found that a LRT contrasting the site-models M0 and M3 was highly significant (p
< 0.0001, [see additional file 5
: Table_S4]), providing support for considerable variability in the selective pressures acting within the member genes of the Epd family. When we plotted the approximate posterior mean of the ω
ratio at each codon site of the whole Epd protein family (Figure ), the amino acids present in the most evolutionarily constrained positions with a ω
≤ 0.04 were: C45
. For all these residues except for E98
, the finding of strong evolutionary constraint is corroborated by our previous comparative analysis of sequence logos (Figure and Figure ). The E98
site appears to be evolutionarily selected for acid residues in the more derived members of the Epd protein family; since this site is occupied by a Glu in all the members of the MERPs group, by an Asp in all the members of the FishBrain and FishTj groups. But in the Basal group this position is occupied by the aliphatic amino acid Leu.
The analysis of selective pressure acting in average over all sites of the entire data set and sub-datasets [see additional file 5
: Table_S4], suggest that purifying selection (with among-site variability) has been the main influence on the evolution of the entire Epd protein family (ω0
= 0.1553) and their sub-groups (ωMERPs
= 0.115, ωFishBrain
= 0.182, ωFishTj
= 0.195, ωBasal
= 0.051). Moreover, none of the site-specific codon models (suitable for detecting adaptive evolution) implemented to the datasets, suggested the action of positive selection when the appropriate LRTs were performed (M1a vs. M2a, M7 vs. M8. data not shown). However, these site models averaged the selective pressure over all sites of the whole phylogeny or subgroups, and might have failed to detect short episodes of positive selection taking place over a few amino acid sites after a duplication event. Therefore, we implemented the branch-site models A and B [50
] to detect if positive selection was driving the evolution of some sites along specific branches of the Epd phylogeny. These models let the ω
ratio vary among sites and among lineages. We performed the Test 2 or "branch-site test of positive selection" [51
] (see Methods) contrasting the model A against itself with ω2
fixed to 1 for each PD branch as defined in Figure (FishBrain, FishTj, or MERPs). We found evidence of episodic adaptive evolution acting along the MERPs branch (p
< 0.0001; [see additional file 5
: Table_S4]). Five sites (86, 128, 196, 224 and 231 -numbered according to the WebLogo on Figure ) had a high posterior probability (> 0.95) of being positively selected considering the Bayes Empirical Bayes (BEB) method as implemented on model A [51
]. The site 86 is occupied in the MERPs group by basic amino acids with positively charged side chains (Arg, Lys, or His). In the FishBrain group this position is occupied only by polar amino acids being the more common Asn, Asp and Ser. Quite the opposite is observed for this position in the FishTj group, in which this site is occupied only by hydrophobic amino acids. However, in the Basal group this site is very variable and can be occupied by either polar or hydrophobic amino acids. The site 128 is strictly occupied by the aromatic amino acid Trp in all members of the MERPs group; it is preferentially occupied by aromatic residues (Phe, Tyr or Trp) in the FishTj and Basal groups. In contrast, in the FishBrain group this site is highly variable and usually occupied by Arg and Lys. The same analysis can be done for the sites 196, 224 and 231 that are preferentially occupied in the MERPs group by the hydroxylic amino acids Thr (sites 196 and 231) and Ser (site 224); while these sites are highly variable in the other paralog groups, being occupied mostly by non-hydroxylic polar residues.
Along the FishTj branch, we also obtained significant evidence for positive selection according to the Test 2 (p
= 0.0233; [see additional file 5
: Table_S4]), but only the amino acid site 144 was predicted with the BEB method under Model A. In the FishTj group, this site is occupied by an aliphatic amino acid (Val or Ile). In contrast, in the FishBrain group an aromatic residue (either Tyr or Phe) is usually present, except in the Fugu_Brain and the Tetraod_Br sequences that have a Ser. Interestingly, in the MERPs group, all the mammalian sequences have a Ser at position 144, but all other taxa within this group (i.e., fishes, amphibians, birds and echinoderms) have aromatic residues. Position 144 is highly variable in the Basal group and shows no clear pattern of amino acid replacement.
In contrast to the MERPs and FishTj groups, the evidence for positive selection affecting the FishBrain lineage was not as clear (p
= 0.0747; [see additional file 5
: Table_S4]), and can be considered only to be marginal support for adaptive selection or relaxed selective constraints at three sites (51, 126 and 186). In the FishBrain group, the residue in position 51 is almost always occupied by the hydroxylic amino acid Thr, except in the sequences Rhamphicht and 2Carassius which have Ile in this site. This position is occupied by non-polar amino acids in all other groups, and is always Trp in the MERPs group. The site 126 in the FishBrain group can be occupied by polar (Tyr, Ser or Cys) and non-polar amino acids (Phe); but in the MERPs group this position is only occupied by polar amino acids such as Gln, and specially the acidic amino acids Glu and Asp. The amino acid composition of this site for the FishTj and Basal groups is highly variable, without a clear pattern. Position 186 is one of four contiguous amino acids conserved in all FishBrain and FishTj groups (but not in the fish sequences belonging to the MERPs group), and in the sequences Sea_cucumb and Ciona_Tun2. These amino acids appear to be specifically inserted in FishBrain and FishTj sequences and later on by convergence, the sequences from echinoderms and one of the Ciona epd
genes might have acquired it. Alternatively, these four amino acids could have been lost specifically in the ancestral MERP sequence and also in certain Basal groups, with exception of the echinoderms and one of the Ciona
Almost all the available sequences from the FishBrain group have a Cys in position 186, which is one amino acid before the third common Cys (C188
). Only in Percomorpha and Salmoniformes is this Cys substituted by Gly. These cysteines, which are very close together in the Epd linear structure, are predicted by the Disulfind server [30
] to form disulfide bonds. The consistent occurrence of Gly, a small amino acid with a very high conformational flexibility, between the C186
would permit a disulfide bridge between them, since Gly does not present a steric obstacle. Alternatively, although with a lower confidence of connectivity, C186
may participate in shuffling reactions forming potentially an array of disulfide intermediate species as result of its binding with other cysteines [53
]. Only when the crystallographic structure of proteins belonging to the different Epd protein subgroups is obtained and careful functional assays are performed, we will be able to test if the different possible transition stages of disulfide bonding are having an impact on the functional divergence among Epds paralogs.
Detection of functional divergence among ependymin paralogs
Detectable differences in the site-rate of amino acid replacement between Epd paralog groups can give us an idea of the grade of functional divergence generated since the duplicated genes diverged, splitting ancestral functions or generating new ones, and consequently succeeded avoiding pseudogenization. The coefficient of evolutionary functional divergence (θ) obtained for each Epd paralog pair comparison was significantly greater than zero (Table ), indicating that there is significant heterogeneity in the amino acid site-specific rate of evolution among Epd paralogs. This result further supports the estimates obtained at the codon level by applying Paralog models and site-models to separate datasets of the paralogs (see previous section). That is, that each Epd group has been subjected to different functional constraints in specific amino acid sites and therefore, functional divergence among them can be inferred.
Maximum likelihood estimates of the coefficient of functional divergence (θ) from pairwise comparisons between ependymin groups
The six possible pairwise comparisons among Epd paralog groups were performed (Table ). The comparison between the FishBrain and MERPs groups showed the highest value for θ
(0.85 ± 0.12), suggesting that these two groups have diverged considerably more at the functional level. This estimate is supported by the fact that the expression pattern of these proteins is the more dissimilar, being the proteins belonging to the FishBrain group expressed exclusively in brain tissue [13
], and the ones from the MERPs group expressed in several tissues including brain [7
]. High θ
values are also found for each one of the pairwise comparisons that can be done against the Basal group, suggesting that if the proteins belonging to the Basal group are in fact the root of the epd
gene tree, the derived groups have significantly diverged functionally from the ancestral gene function, expression pattern or both. This measure of functional divergence among the groups (FishBrain, FishTj and MERPs) that were contrasted with the Basal group was not equidistant: the predicted functional divergence between the FishTj group and the Basal group (θ
= 0.81 ± 0.15) is higher than that for FishBrain vs. Basal (θ
= 0.63 ± 0.13) or MERPs vs. Basal (θ
= 0.70 ± 0.21). As could be expected from the gene tree (Figure ), the smallest θ
value (being however also significantly greater than zero) was obtained for the FishBrain vs. FishTj comparison (θ
= 0.35 ± 0.09). The inferred phylogenetic gene tree suggests us that the duplication that gave rise to the FishBrain and FishTj paralog epd
genes was posterior to the appearance of the MERPs group. Thus, the FishTj and FishBrain shared more time together and although the FishBrain group already has a very restricted pattern of expression that contrasts with the ample array of tissues from which FishTj genes have been isolated, we might suppose that some functional overlap may remain. Of course, it will be interesting to corroborate this assumption with experimental data where knocking out one of the genes is compensated (at least partially) by the other. Interestingly, the estimated coefficient of functional divergence between the FishTj and the MERPs group (θ
= 0.52 ± 0.09) is not as high as we could suppose from the gene tree, although this would suggest different functional roles it can not exclude that some of these functions overlap.
Posterior Bayesian analysis predicted several amino acid positions that may account for the inferred functional divergence among the Epd paralog groups (Figure ). Even though, sites with an estimated posterior probability (PP) < 0.8 have been experimentally proved to be important for the observed functional divergence between the two major Caspase subfamilies [54
]; the cutoff value for residue selection is an empirical decision and is expected to depend on the intrinsic properties of the protein family being analyzed. Thus, while in [54
] they obtained 21 candidate functional divergence-related sites using 0.61 as cutoff value, the same cutoff value predicts 53 sites for the Epd data. Since no 3D structure of any Epd protein is yet available, we lack a way to verify where these sites would be located nor how the rate-shift in these sites contributed to structural and functional divergence among the Epd paralogs. Nonetheless, we can further narrow our criteria for site prediction expecting that in the case of the Epd data, sites predicted with a more stringent PP (i.e., > 0.9) be in fact functional divergence-related sites that can serve as a discrete starting point for future functional characterization of the Epd proteins. Twenty-three rate-shifted amino acid sites (8.8% of total sites) were predicted with a PP > 0.9 of being functional divergence-related sites for any of the six possible pairwise comparisons among Epd paralog groups (Figure ). The rate-shifted site 161 (the amino acid after the common Trp of the Epds -see Figure ), for example, is predicted with a PP = 0.99 to be able to functionally distinguish the members of the FishBrain from the members of the MERPs group, with a PP = 0.93 to distinguish the FishTj from the MERPs group, and with intermediate PPs to distinguish FishBrain from Basal (PP = 0.68), FishTj from Basal (PP = 0.79) and MERPs from Basal (PP = 0.82); but this same site only has a PP = 0.14 to discriminate FishBrain from FishTj proteins. When this site was localized in the subgroups alignments (site indicated by an arrow in the Figure ), it is clear that being highly variable, the site is not useful for discriminating between FishBrain and FishTj proteins. In contrast, the MERPs Epd proteins have a Ser fixed in this position which is within the stretches of amino acids diagnostic of this group (QEWS
DR--aR--E-WXGxyT, underlined in the MERPs WebLogo, Figure ). In addition, in the Basal group site 161 is occupied with almost equal occurrence by Ser, Asn, His, Tyr, or Arg. Consequently, once the 3D structure of Epd proteins from the differences subgroups are available, and site-directed functional experiments can be done, we expect to be able to corroborate that the rate-shifted site 161 is somehow important to the particular function of the MERPs Epds.
Figure 6 Prediction of functional divergence-related sites among ependymin protein groups. Site-specific profile showing the amino acids predicted to have a posterior probability [P(S1|X) > 0.90] of being functional divergence-related sites in any of the (more ...)
A similar analysis can be performed for all the other 22 rate-shifted predicted sites. It is worth noting that sites 128 and 144, predicted to be under positive selection in the MERPs and FishTj lineages respectively by using branch-site codon models (see previous section), were also predicted to be functional divergence-related sites using this very different approach. As previously noted, the site 128 is preferentially occupied by aromatic residues in the FishTj, Basal and MERPs groups, being always occupied by a Trp in the latter; but in the FishBrain group this site is very variable. This explains why this site has a PP = 0.90 of being related with the type 1 functional divergence between the MERPs and FishBrain groups, and has intermediate PPs for the other possible comparisons: FishBrain vs. FishTj (PP = 0.72), FishBrain vs. Basal (PP = 0.77), FishTj vs. Basal (PP = 0.74), MERPs vs. Basal (PP = 0.63) and FishTj vs. MERPs (PP = 0.43). It is also noteworthy that the presence of aromatic amino acids at this site can distinguish the Epd proteins that are exclusively expressed in the brain, from the proteins that have a wider expression pattern. Additionally, the presence of a Trp at this site appears to be important for the function of the Epds belonging to the MERPs group as suggested by the convergent results obtained using codon models and the search for type 1 functional divergence related sites (see Methods).
Position 144 was predicted to be under positive selection and additionally to be related to functional divergence. This site has a PP = 0.92 for discriminating between the FishBrain and the MERPs groups and with less confidence can distinguish between other groups: FishTj vs. Basal (PP = 0.78), FishTj vs. Brain (PP = 0.77), FishBrain vs. Basal (PP = 0.64), MERPs vs. Basal (PP = 0.61) and FishTj vs. MERPs (PP = 0.22). Contrary to what occurs with site 128, the relationship of aromatic amino acids and tissue specific expression appears to be inverse in site 144. Thus, the preferential occurrence of aromatic amino acids in the FishBrain group (particularly Tyr) appears to be important for the functional divergence of FishBrain group as compared with the others.