|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: AL. Performed the experiments: CAC AB AL. Analyzed the data: CAC AB AL. Contributed reagents/materials/analysis tools: CAC AB AL. Wrote the paper: CAC AB AL.
The evolution of oxygenic photosynthesis during Precambrian times entailed the diversification of strategies minimizing reactive oxygen species-associated damage. Four families of oxygen-carrier proteins (hemoglobin, hemerythrin and the two non-homologous families of arthropodan and molluscan hemocyanins) are known to have evolved independently the capacity to bind oxygen reversibly, providing cells with strategies to cope with the evolutionary pressure of oxygen accumulation. Oxygen-binding hemerythrin was first studied in marine invertebrates but further research has made it clear that it is present in the three domains of life, strongly suggesting that its origin predated the emergence of eukaryotes.
Oxygen-binding hemerythrins are a monophyletic sub-group of the hemerythrin/HHE (histidine, histidine, glutamic acid) cation-binding domain. Oxygen-binding hemerythrin homologs were unambiguously identified in 367/2236 bacterial, 21/150 archaeal and 4/135 eukaryotic genomes. Overall, oxygen-binding hemerythrin homologues were found in the same proportion as single-domain and as long protein sequences. The associated functions of protein domains in long hemerythrin sequences can be classified in three major groups: signal transduction, phosphorelay response regulation, and protein binding. This suggests that in many organisms the reversible oxygen-binding capacity was incorporated in signaling pathways. A maximum-likelihood tree of oxygen-binding hemerythrin homologues revealed a complex evolutionary history in which lateral gene transfer, duplications and gene losses appear to have played an important role.
Hemerythrin is an ancient protein domain with a complex evolutionary history. The distinctive iron-binding coordination site of oxygen-binding hemerythrins evolved first in prokaryotes, very likely prior to the divergence of Firmicutes and Proteobacteria, and spread into many bacterial, archaeal and eukaryotic species. The later evolution of the oxygen-binding hemerythrin domain in both prokaryotes and eukaryotes led to a wide variety of functions, ranging from protection against oxidative damage in anaerobic and microaerophilic organisms, to oxygen supplying to particular enzymes and pathways in aerobic and facultative species.
Before the evolution of oxygenic photosynthesizers, sources of free oxygen were scarce . Free molecular oxygen constitutes 21% of present-day terrestrial atmosphere and its main source is, essentially, oxygenic photosynthesis. Accumulation of free atmospheric oxygen during the Precambrian [1–3] is, undoubtedly, one of the major changes in the history of the planet and may be considered the most significant biogeochemical process after the origin of life itself. Oxygen-dependent metabolism evolved first in bacteria and is pervasive in contemporary eukaryotes.
The evolution of metazoans was constrained by the oxygen requirements of tissues [4–6]. Therefore, oxygen-carrier proteins that maintain a continuous delivery of oxygen while avoiding autoxidation as well as the formation and accumulation of reactive oxygen species  became essential for the development of animals. Four evolutionarily unrelated families of oxygen-carrier proteins are known: hemoglobin, hemerythrin and the two non-homologous families of molluscan and arthropodan hemocyanins. Such diverse assortment can only be understood in terms of the selective pressure imposed on the biosphere by free oxygen.
Hemerythrin is a small 118 amino acid protein classically studied in four metazoan phyla: Sipuncula, Brachiopoda, Priapulida and Annelida [7,8]; it has also been identified in other eukaryotes as well as in bacterial and archaeal genomes [9–12]. However, the molecular evolution of hemerythrin and its relationship to the geochemical history of the Earth has been largely ignored. In this work, we studied the phylogenetic distribution of hemerythrin-like sequences in 2521 completely sequenced bacterial, archaeal and eukaryotic genomes, and correlated the possible evolutionary scenarios with what is currently known about the structure and function of the hemerythrin domain in both prokaryotes and eukaryotes.
Oxygen-binding hemerythrin (O2-binding Hr) homologs were identified based on the statistical significance of aligning O2-binding Hr sequences from two annelid species, Phascolopsis gouldii  and Themiste hennahi , with non-mutated protein sequences annotated as hemerythrin or hemerythrin-like proteins in the PDB. The sequences of two O2-binding Hrs, from the proteobacteria Methylococcus capsulatus  and Desulfovibrio vulgaris , were statistically similar to the annelid sequences (Table 1), and were used as queries in subsequent sequence similarity searches. In contrast, the N-terminal domain of the human F-box and leucine-rich repeat protein 5 (FBXL5), which possess an iron-coordination site that does not bind oxygen [17–19], was not significantly similar to any of the annelid hemerythrin sequences used here as reference (Table 1). We constructed a profile Hidden Markov Model (S1 Fig) with 148 nodes, exclusively based on O2-binding Hr homologues, which constitute a divergent monophyletic sub-group of the hemerythrin/HHE (histidine, histidine, glutamic acid) cation-binding domain (Fig 1).
O2-binding Hr sequences were identified in 367/2236 bacterial, 21/150 archaeal and 4/135 eukaryotic genomes (Fig 2). Sequences with subject coverage lower than 85% were considered long sequences and the presence of additional protein domains was investigated using the Pfam-A database (Figs (Figs33 and and4).4). O2-binding Hr homologues were found in the same proportion as single-domain and as long protein sequences in archaeal, bacterial and eukaryotic genomes overall. A total of 56 different domain architectures were identified in long O2-binding Hr sequences (Fig 4 and S2 Fig). The largest group of long O2-binding Hr sequence architectures (56.9%) contained N-terminal or C-terminal regions of variable size with no clear homologs in Pfam-A. The most frequent location of the O2-binding Hr domain in long protein sequences is the protein termini (117 proteins at the C-terminus, 52 proteins at the N-terminus), suggesting a later incorporation by gene fusion events. The most frequent architectures were: 1) N-terminal methyl accepting chemotaxis protein domain (MCP signal) with a C-terminal O2-binding Hr domain (33 protein sequences); 2) N-terminal O2-binding Hr domain with a C-terminal diguanylate cyclase (GGDEF) domain (22 protein sequences); and 3) N-terminal histidine kinases, adenyl cyclases, methyl-accepting proteins and phosphatases (HAMP) domain with a MCP signal and a C-terminal O2-binding Hr domain (14 protein sequences). The Gene Ontology terms associated to the Pfam-A domains identified corresponded to: 1) signal transduction (43%); 2) phosphorelay response regulation (7%); and 3) protein binding (6%).
Of the 2521 cellular genomes that were analyzed in this work, a total of 392 (367 bacterial, 21 archaeal and 4 eukaryotic genomes) encoded for O2-binding Hr sequences (Fig 2). The number of O2-binding Hr copies in a genome varies widely, in particular, species of Spirochaetes, Chlorobi, Acidobacteria, Firmicutes-Clostridia, and α-, β- and δ- Proteobacteria present several copies of single-domain O2-binding Hr sequences (Fig 3). The greatest number of single-domain O2-binding Hr copies was found in Magnetospirillum magneticum AMB-1, a facultative α-Proteobacteria encoding 15 paralogous sequences (Fig 3).
The sample studied here included archaeal genomes only from the Crenarchaeota and Euryarchaeota phyla due to an under-representation of Archaea in genome databases and the low agreement between the databases consulted (See the Methods section). Within Crenarchaeota, O2-binding Hr was found only in five species of the order Desulfurococcales within long protein sequences. In Euryarchaeota, five species of the Methanomicrobia class contained single-domain O2-binding Hr, while six species of the Methanococci class had either single-domain O2-binding Hr or long sequences consisting of an O2-binding Hr domain with an AMP-binding domain or with a short orphan elongation. Eukaryotic O2-binding Hr sequences were found in Micromonas pusilla CCMP1545 and Acanthamoeba castellanii Neff only as single-domain sequences, and in Nematostella vectensis and Naegleria gruberi as both single-domain and long protein sequences. The profile Hidden Markov Model used in this work identified a negligible number of O2-binding Hr sequences in eight bacterial phyla (Thermotogae, Actinobacteria, Tenericutes, Chlamydiae, Planctomycetes, Bacteroidetes and Fusobacteria) and in the class Bacilli of Firmicutes (Fig 2). With the exception of Fusobacteria and Tenericutes, the homologous hemerythrin/HHE cation-binding domain was found in species from the phylogenetic groups where O2-binding Hr was not present (S3 Fig). In early-branching Bacteria (Aquificae, Chloroflexi, Deinococcus-Thermus and Cyanobacteria) and Firmicutes, single-domain O2-binding Hr accounted for 91.6% of the sequences identified. In contrast, in later bacterial groups (Acidobacteria, Chlorobi, Spirochaetes, and Proteobacteria) only 31.9% of the O2-binding Hr homologues were single-domain sequences.
As shown in Fig 2, the sequences of single-domain HHE cation-binding hemerythrin and single-domain O2-binding Hr, which we have analyzed in this work, are the outcome of an ancient gene duplication predating the explosive divergence of Bacteria during Precambrian times. The maximum-likelihood tree in Fig 2 also shows that the single-domain O2-binding Hr is a well-defined, monophyletic, highly divergent group distinct from the hemerytrin/HHE cation-binding domain.
The tree in Fig 5 shows two well-defined single-domain O2-binding Hr derived groups, a and b, formed by sequences encoded mostly by anaerobic species. The derived clade a includes sequences from Deinococcus-Thermus; Cyanobacteria; Firmicutes; Spirochaetes; and δ-Proteobacteria. The derived clade b is formed by sequences from Acidobacteria; Chlorobi; α-, β-, γ- and δ-Proteobacteria, as well as from the archaeal Methanospirillum hungatei JF-1, where it is present most likely because of a lateral gene event. Sequences in both derived clusters diverged from an early gene duplication of the single-domain O2-binding Hr gene, and also appear to have been subject of lateral gene transfer, gene loss, gene duplication and orthologous replacement events.
There are several cases in which more than one O2-binding Hr sequence can be identified in a single cellular genome. This may be due to lateral gene transfer events or, most likely, to paralogous duplication events (Table 2).
The many cases in which highly similar single-domain O2-binding Hr copies are closely grouped in the same cluster in the phylogenetic tree suggest recent paralogous duplication events. For instance, Magnetospirillum magneticum AMB-1, a facultative proteobacteria, encodes for 15 single-domain O2-binding Hr gene copies, some of which are located in the same cluster, while others are found in distant clusters: amb4136 is represented by a single branch; amb4265, amb2654, amb4171, amb2239 and amb1415 are in cluster ε; amb1987, amb0226, amb1549 and amb2249, in cluster ι; amb3418, amb4296 and amb3966 in cluster η; amb0569 in cluster ζ; and amb1952 in cluster b1, indicating duplication events that occurred at different times during evolution of M. magneticum AMB-1 (Fig 5). Alternatively, the peculiarities of the distribution of one or more O2-binding Hr gene copies can be explained by lateral gene transfer events.
The overall topology of the single-domain O2-binding Hr phylogenetic tree is in clear disagreement with the 16/18S rRNA reference tree (Fig 5). This indicates that the significance of O2-binding Hr as a phylogenetic marker is hindered by a complex history of gene losses, gene duplications, paralogous replacements and lateral gene transfer events. The base of the tree is characterized by a highly branching pattern of single-domain O2-binding Hr sequences encoded by aerobic organisms including Bacteria, Archaea and Eukarya (Fig 5 and Table 3). The ample distribution of highly divergent O2-binding Hr sequences among aerobic organisms is probably best understood by the adaptive value of oxygen-binding proteins in a Precambrian environment that was becoming increasingly oxidizing as time went by.
Agreement with the topology of the species tree was observed mainly at the sub-group level of the O2-binding Hr sequence tree. For instance, as shown in Fig 5 and Table 3, sequences from Aquificae form a minor sub-group (sub-group α), suggesting that a single-domain O2-binding hemerythrin was present in their common ancestor. A comparable situation may be seen in the θ sub-group, which can be interpreted as the outcome of vertically inherited single-domain O2-binding Hr in species of Cyanobacteria subclass Oscillatoriophycideae. In the case of archaeal sequences, Methanococci and Methanomicrobia form two separate groups. Sequences from Methanococci, and bacterial sequences from Anaerolinea thermophila UNI-1, Persephonella marina EX-H1, Dechlorosoma suillum PS and Acidovorax ebreus TPSY form cluster μ. A distinct independent sub-group ν is formed by sequences from Methanomicrobia, and bacterial sequences from Melioribacter roseus P3M-2, Geobacter lovleyi SZ and Pelobacter carbinolicus DSM 2380. The topology of the single domain O2-binding Hr tree, together with the absence of single-domain O2-binding Hr in most archaeal groups, could indicate the acquisition by Methanomicrobia and Methanococci of the single-domain O2-binding Hr sequences in two independent events of lateral gene transfer. Eukaryotic sequences from Nematostella vectensis and Naegleria gruberi cluster together in sub-group λ.
The HHE cation-binding domain was first predicted by bioinformatics methods as a domain composed of two helical regions and a conserved HHE cation-binding site . The hemerythrin-like domain family is a repetition of the HHE cation-binding domain, which folds into an up-and-down bundle of four left-handed helices [14,23]. The molecular function of proteins containing the hemerythrin/HHE cation-binding domain, including metazoan oxygen-carrier hemerythrins, is often related to O2 or reactive oxygen species responses [10,16,24–29].
The search for O2-binding Hr domain homologs was performed using a manually curated profile Hidden Markov Model, and resulted in the identification of sequences of 86 to 2425 amino acids length, using a profile that contained 148 nodes (S1 Fig), highly weighting the position of the iron-coordinating amino acids in X-ray solved structures of O2-binding Hrs. Sequences with subject coverage lower than 85% were considered long sequences. To identify the presence of possible additional domains, we searched the protein profile database Pfam-A, which confirmed known additional domains in 37.2% of long sequences. In agreement with previous bioinformatics searches [9,11], the most frequent domain found in long O2-binding Hr sequences was the MCP signal domain (Fig 4). The characterization of the hemerythrin domain of the methyl-accepting chemotaxis protein dcrH from Desulfovibrio vulgaris has shown that the four-helix bundle fold and the amino acids of the active site are conserved . It has been proposed that the hemerythrin domain in this structure could have a role in signal transduction .
O2-binding Hr sequences were identified in two archaeal groups (Euryarchaeota and Crenarchaeota), ten bacterial groups (Aquificae, Deinococcus-Thermus, Cyanobacteria, Chloroflexi, Firmicutes, Acidobacteria, Chlorobi, Spirochaetes and Proteobacteria), and four eukaryotic species (Naegleria gruberi, Micromonas pusilla CCMP1545, Nematostella vectensis and Acanthamoeba castellanii). It is also known to be present in marine invertebrates (Sipuncula, Brachiopoda, Priapulida and Annelida) . By far, the highest number of O2-binding Hr sequences is found in Proteobacteria (Figs (Figs22 and and33).
Single-domain and long O2-binding Hr-containing sequences are not randomly distributed in the phylogenetic tree, but exhibit a differential distribution among taxonomic groups, particularly in Bacteria, where two separate groups can be distinguished. In the first bacterial group, that includes deep-branching bacteria and the Firmicutes-Clostridia clade, O2-binding Hr homologues are predominantly single-domain sequences (91.6%). The expression of the single-domain HerA hemerythrin in the microaerophilic bacteria Campylobacter jejuni  reduces its susceptibility to oxygen and hydrogen peroxide-mediated damage of two iron-sulphur cluster enzymes (pyruvate:acceptor oxydoreductase and 2-oxoglutarate:acceptor oxidoreductase). This suggest that single-domain O2-binding Hr may also be involved in the prevention of oxygen-mediated damage in microaerophilic aquificales and in anaerobic clostridia, and probably reflect an evolutionary adaptation to the presence of free molecular oxygen. The absence of O2-binding Hr homologues in Mollicutes and the class Bacilli of Firmicutes, two bacterial groups closely related to clostridial species, is probably due to secondary loss.
The second bacterial group includes Acidobacteria, Chlorobi, Spirochaetes and Proteobacteria. Within this group, long O2-binding Hr sequences (68.1%) are more frequent than single-domain (31.9%) O2-binding Hr. In Proteobacteria, which is the most diverse bacterial phylum, the study of the O2-binding Hr domain led to the identification of three different functions: a) as a 135-amino acid domain in a chemotactic protein of Desulfovubrio vulgaris ; b) as a single-domain protein McHr in Methylococcus capsulatus, that supplies oxygen to the membrane-bound methane monooxygenase ; and c) as a single-domain protein HerA in Campylobacter jejuni, which protects iron-sulphur cluster enzymes from oxidative damage . This exemplifies how the O2-binding Hr domain may have been coopted into specific physiological functions by different species, particularly in later-branching bacteria, where it was incorporated into a wide variety of sequences. In some cases, the additional sequences within long O2-binding Hr sequences are group-specific. For instance, long O2-binding Hr sequences from Acidobacteria and Chlorobi exhibit elongations that do not match domains in the Pfam-A database, and two architectures of the long O2-binding Hr sequences from Spirochaetes genomes contain a domain called Cache_1, which is an acronym for calcium channels and chemotaxis receptor.
Variations at the iron-coordinating amino acid residues in 106 single-domain O2-binding Hr sequences were observed (S4 Fig). This is quite evident in sequences from the Deinococcus-Thermus, Cyanobacteria, Firmicutes-Clostridia and from the α-, β- and δ- Proteobacteria clades. Amino acid substitutions may modify the reversible oxygen-binding ability of hemerythrins , or even the native metal-ion preference. Molecular characterization of the variants could clarify whether sequence variations at the iron-coordination site produce loss of function, neo-functionalization, or alternative iron-binding mechanisms.
Hemerythrin homologues have an ample biological distribution, and are present in the three domains of life. However, as shown here, their distribution is not universal, which is consistent with previous genomic searches, and may be explained by frequent gene losses [11,12]. Although the topology of the O2-binding Hr tree indicates intense horizontal gene transfers, gene duplications and differential gene loss (Fig 5), it is clear that O2-binding Hr is widely distributed in Clostridia and all five classes of Proteobacteria, suggesting that it is an ancient Precambrian trait that was already present before the divergence of Firmicutes and Proteobacteria. Alternatively, the evolution of O2-binding Hr may have occurred in a divergent phylogenetic group, and was horizontally transferred, through independent events, to the individual orders of Clostridia and/or Proteobacteria. This alternative scenario represents a less parsimonious explanation to the overall O2-binding Hr distribution. Instead, horizontal gene transfer could explain the cases where the presence of O2-binding Hr is circumscribed to a particular family or genus, as appears to be the case for archaean O2-binding Hr.
The four known evolutionary unrelated oxygen-carrier protein families (hemoglobin, hemerytrhin and the two non-homologous molluscan and arthropodan hemocyanins) represent a polyphyletic response to the selective pressure imposed by the accumulation of free oxygen during the Precambrian. Like other molecular mechanisms involved in the protection and repair of oxygen-induced damage that evolved during Precambrian times , the ample biological distribution of O2-binding Hr probably reflects its adaptive significance in an environment became increasingly oxidizing. The biological group where hemerythrin first evolved remains unknown, but the distribution of O2-binding Hr sequences suggest that its capacity to reversibly bind oxygen was exploited first by prokaryotic species to fulfill a range of different needs, ranging from protection against oxidative damage to oxygen supply to particular enzymes and pathways. More specifically, the available data suggests that it may have originated prior to the divergence of Firmicutes and Proteobacteria. Thereafter, the incorporation of the O2-binding Hr domain into preexisting proteins, combined with other mechanisms involved in the protection against oxidative damage, may have allowed a functional diversification of the protein repertoire particularly in the proteobacteria, one of the most diverse groups of bacteria. The evolution of O2-binding Hr sequences appears to parallel the evolution of strategies allowing the incorporation of oxygen to biological processes as a consequence of accumulation of free oxygen in the Precambrian oceans and atmosphere.
Eleven non-mutated hemerythrin and hemerythrin-like amino acid sequences retrieved from the Protein Data Bank (PDB) database  were compared to the hemerythrin sequence from Themiste hennahi (PDB codes 2MHR) and from Phascolopsis gouldii (PDB code 1I4Y) with the program PRSS version 36.3.8c (number of shuffles: 200; scoring matrix: Blosum50; open gap penalty: -10; extension gap penalty: -2) from UVa FASTA Server [34,35]. Expect value lower than 1e-3 of one-to-one alignments was used as cutoff for homologous sequences.
An initial BLAST search against the Kegg Genes database release 75.1[20,36] identified 365 protein sequences with expect value < 1e-5 and subject coverage > 95%. The sequences were aligned with MAFFT G-INS-I algorithm  with default parameters (gap opening penalty: 1.53; offset: 0.0). A profile Hidden Markov Model (pHMM) was constructed with HMMER 3.1b1 . The Pfam hemerythrin profile (http://pfam.xfam.org/family/hemerythrin)  has a low restriction to include sequences with changes at the iron coordination amino acids observed in oxygen-binding hemerythrins (S1 Fig). The distribution of the hemerythrin domain homologues found by the Pfam profile is shown in S3 Fig. The hand curated hemerythrin pHMM was scanned against 2521 genomes from the Kegg Genes database release 75.1 that were unequivocally traced to an entry of the SSU RefNR99 guide tree from the SILVA SSU database release 119  based on taxonomy identifiers from both databases.
Each sequence was scanned with hmmscan from the HMMER 3.1b1 software using the database of protein families Pfam-A version 28.0 . The hemerythrin profile in Pfam-A was substituted with the pHMM that we generated. Domain overlapping was solved with a pearl script preferring domains with lower expect values. Gene Ontology information on molecular function of Pfam domains was obtained from the Pfam web page .
The maximum-likelihood (ML) tree of single-domain oxygen-binding hemerythrin sequences and of hemerythrin HHE cation-binding domain sequences were constructed with PHYML . The parameters for the construction of the trees (Table 4) were automatically selected with SMS: Smart Model Selection using the Akaike Information Criterion at the South of France bioinformatics platform (http://www.atgc-montpellier.fr). Branch support was given by the approximate likelihood-ratio test (aLRT) for branches . Internal nodes with low support (< 0.6) were collapsed. Additional bootstrap support is provided by a 100-replicates as shown in S5 Fig. The multiple sequence alignments were constructed with MAFFT, L-INS-i algorithm using the default parameters (gap opening penalty: 1.53; offset: 0.0). The maximum-likelihood tree of hemerythrin HHE cation-binding domain sequences was midpoint-rooted; the maximum-likelihood of single-domain oxygen-binding hemerythrin sequences was rooted following the topology of the hemerythrin HHE cation-binding domain tree. Tree visualization was made with Interactive Tree Of Life V1.0 .
Logo representation of the multiple sequence alignment used as seed to calculate the profile Hidden Markov models. The vertical axis indicates the information content of a sequence position. The one-letter notation for amino acid sequences was used. Glutamic and aspartic acid (purple), histidine (cyan). (A) O2-binding Hr model. Positions of the iron-coordinating amino acids are indicated by arrows. Position and length of helical structures were predicted by Ali2D. (B) Pfam-A hemerythrin model.
Protein domains identified in long O2-binding hemerythrin sequences are designated by their short Pfam-A family name. The number of sequences showing a particular architecture is indicated after a tabular space.
Phylogenetic tree based on a small subunit rRNA guide tree containing only completely sequenced species. Bacterial and archaeal species are collapsed on the phylum level. Eukaryotic species are collapsed together. n: number of species within the collapsed branch. The red bar is proportional to the number of species with at least one hemerythrin HHE cation-binding domain sequence in each group. The total number of genomes with at least one HHE cation-binding domain sequence is indicated by a purple number next to the bar.
Claudia Alvarez is a doctoral student in the Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México (UNAM), and received fellowship 290175 from CONACYT. We are indebted to J. Peter Gogarten for insightful discussions and to Luis Montaño, Sara Islas and Ricardo Hernández for their help with the manuscript. Support from DGAPA-PAPIIT(IN223916) is gratefully acknowledged.
Claudia Alvarez is a doctoral student in the Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México (UNAM), and received fellowship 290175 from CONACYT.
All relevant data are within the paper and its Supporting Information files.