The
ALDH superfamily shows considerable diversity among vertebrate genomes, with species in the current study showing between 14 and 25 putatively protein-encoding genes. Many of the gene duplications discussed here probably encode functional proteins. There are also a number of duplication events that give rise to non-functional pseudogenes. Names were assigned to the 'new genes' and 'pseudogenes' (Table ) according to the
ALDH nomenclature system established in 1999 [
14]. The species-specific nomenclature system was used for zebrafish genes [
15]. Pseudogenes were also named according to the standardised protocol [
20].
In the cow genome, ALDH1A3P1 resembles the product of a partial gene duplication event. The coding region would translate a peptide sharing 100 per cent sequence identity to the 127 carboxy-terminal AAs of the full-length parent gene. Such a high degree of sequence identity is suggestive of a relatively recent evolutionary duplication. Even if the truncated gene encodes the 127-AA peptide; however, it lacks many highly conserved residues required for ALDH activity. Thus, the truncated peptide would probably be targeted for rapid degradation. As such, this gene represents a nonfunctional pseudogene and has been named accordingly.
ALDH1B1 is present in mammals but missing from birds and fish. The high degree of AA sequence conservation between ALDH2 and ALDH1B1 suggests that the latter may be the product of a gene duplication event that occurred some time after the avian-land animal split around 310 MYA. Future analyses should consider other species, including amphibians and reptiles, in order to verify and more accurately pinpoint this evolutionary event.
Analysis of the aldh2 gene duplications in zebrafish indicates that these represent protein-coding genes and not pseudogenes. As mentioned above, translation of either gene would result in a full-length peptide. The aldh2.2 gene would encode a product 95.2 per cent identical to that of the parent gene aldh2.1. At 95.2 per cent AA identity, aldh2.2 represents a new gene. The aldh2.3 homologue may represent a more evolutionarily recent duplication of aldh2.2, as evidenced by the ~99.6 per cent sequence identity noted. Therefore, aldh2.3 is likely to be a gene-duplication event of aldh2.2. All three protein products include the conserved ALDH motifs and residues required for enzyme activity.
The
ALDH3 family showed the greatest variability among species. ALDH3A1 facilitates cell cycle regulation and scavenging of reactive oxygen species, and acts as a corneal crystallin by filtering UV irradiation in the eye.
ALDH3A1 is missing from birds and fish but is present in every mammalian genome analysed in this study, suggesting that the gene evolved some time after 310 MYA.
ALDH3A1 is conserved among mammals and shows no apparent duplications. In some species, such as rabbit, it appears that
ALDH1A1 is expressed as a corneal crystallin instead of
ALDH3A1 [
31]. Interestingly, zebrafish is the only species in this study that apparently lacks both
ALDH3A1 and
ALDH1A1. Studies have suggested that zebrafish use scinla (cytosolic gelsolin) as a corneal crystallin instead [
32-
34].
Zebra finch ALDH3A3 encodes a full-length peptide that shares 84.1 per cent similarity with the ALDH3A2 parent gene. Zebrafish has three aldh3a2 duplications, which include two full-length genes (aldh3a2.2 and aldh3a2.3) and a significantly truncated partial duplication (aldh3a2p1). The degree of sequence identity that Aldh3a2.2 and Aldh3a2.3 share with the parent peptide (64.9 per cent and 70.9 per cent, respectively) suggests that they diverged sufficiently long ago to be considered new ALDH3A family members. They also share 64.9 per cent identity with each other and less than 60 per cent identity with zebra finch ALDH3A3, suggesting that all three genes are paralogues rather than orthologues. Zebra finch ALDH3A5 should also be considered a new functional ALDH family member. In addition, the zebrafish pseudogene aldh3a2p1, if translated, would share the highest degree of sequence identity with aldh3a2.3. Thus, the pseudogene most likely reflects a more recent partial duplication of this gene.
ALDH3B1 is duplicated in both cow and zebra finch. The cow ALDH3B4-encoded protein would be full length and share 85.4 per cent identity to ALDH3B1, suggesting that it is a new ALDH3B family member. Zebra finch ALDH3B5 shares an extremely high degree of homology with the amino-terminus of ALDH3B1. However, it lacks ~150 AAs that comprise the carboxy-terminus needed for enzyme oligomerisation. The truncated protein would still contain the conserved motifs required for ALDH activity. Until more experimental evidence becomes available, the ALDH3B5 gene should be considered as putatively functional.
The mouse and rat Aldh3b3 genes appear to represent new orthologous ALDH family members; the genes reside in syntenic chromosomal regions and share a high degree (83.4 per cent) of sequence identity with one another. The two proteins are more divergent than the rodent ALDH3B2 orthologues, which share 89.9 per cent sequence identity.
Aldh5a1 is another duplicated ALDH gene within the zebrafish genome. The duplication aldh5a1.2 resides on the same chromosome as the aldh5a1.1 parent gene, and the two share 100 per cent sequence identity. Aldh5a1.2 encodes a peptide containing an additional 22 amino-terminal and 88 carboxy-terminal residues. It also shares greater sequence identity with the human ALDH5A1 orthologue than Aldh5a1.1 (65.5 per cent versus 51.4 per cent). This suggests that aldh5a1.2 might actually be the parent gene and aldh5a1.1 a slightly truncated version formed as the result of gene duplication.
As mentioned above, the macaque
ALDH7A1P5 genomic sequence lacks intronic regions, suggesting that a reverse transcriptase-mediated event gave rise to this pseudogene (ie having no adjacent promoter or other regulatory sequences). Four additional
ALDH7A1 pseudogenes have been identified on chromosomes 5q14 (
ALDH7A1P1), 2q31 (
ALDH7A1P2), 7q36 (
ALDH7A1P3) and 10q21 (
ALDH7A1P4) [
19]. Macaque
ALDH7A1P5 is located on Chr 14, which is not syntenic with human Chr 11 and does not share common origins with any of the human pseudogenes. Therefore, the event that gave rise to
ALDH7A1P5 must have taken place within the last 25 million years.
Three full-length ALDH9A1 homologues were identified in zebrafish. The Aldh9a1.2 peptide shares 71.2 per cent and 70.3 per cent identity with Aldh9a1.1 and Aldh9a1.3, respectively. Aldh9a1.3 is 94.9 per cent identical to the parent Aldh9a1.1 peptide, suggesting that this duplication was a relatively recent event when compared with the duplication that gave rise to Aldh9a1.2. Hence, aldh9a1.1, aldh9a1.2 and aldh9a1.3 represent three distinct protein-coding ALDH9 family members. The zebrafish genome also contains two copies of aldh18a1, which are found in very close proximity on Chr 12. Both genes are considered protein coding and would give rise to peptides of the same length which share 100 per cent sequence identity, suggesting a relatively recent duplication event.
ALDH gene-naming conventions dictate that (i)
ALDH superfamily members sharing more than ~40 per cent AA identity belong to the same family (eg
ALDH1A,
ALDH1B, etc.), and (ii)
ALDH family members that share greater than 60 per cent AA identity belong to the same subfamily (eg
ALDH1A1,
ALDH1A2, etc). This provides a convenient and systematic naming system for an entire superfamily. Interestingly, this does not always indicate homology properly; these rules in the cytochrome P450 (
CYP) gene superfamily are known to break down when one includes evolutionarily distantly related animals [
27]. For example, whereas zebrafish Aldh3d1 and Aldh3b1 share only 50 per cent AA identity, HomoloGene evidence and alignments suggest that
aldh3d1 is probably a duplication of
aldh3b1 (data not shown). Although
aldh3d1 has diverged considerably, it is likely to be more closely related to
aldh3b1 than the naming convention would suggest.
Many of these proteins have been defined based on genomic or dbEST data and have not been studied extensively. Many records remain in databases that are listed as 'protein-coding' but which instead may represent pseudogenes of various types. Furthermore, although the genes here do not have internal stop codons, without functional analysis, it is difficult to determine whether the genes might have other inactivating mutations or if they experience selective pressure. Although automated prediction and naming of ALDH proteins from completely sequenced genomes have achieved a great deal of information in a short amount of time, the alignment, curation and naming of these genes remains an important task. The fact that no new human
ALDH genes have been identified over the past six years and that most other vertebrates seem to have settled close to this number suggests that identification of
ALDH superfamily members in vertebrates is nearing completion. Determining the function and biological importance of each family member still requires additional work, however. As more information becomes available, the web database resource at (
http://www.aldh.org) (the aldehyde dehydrogenase gene superfamily resource center)[
35] will be updated to reflect our current understanding of this diverse and essential gene superfamily.