Analysis of the genomic α-gliadin genes from diploid species that represent the ancestral genomes of bread wheat
The typical structure of the α-gliadin is depicted in Figure . The fact that the sequences at the 5' end (signal peptide) and 3' end of the genes are highly conserved within the α-gliadin gene family enables to obtain different members of the gene family by a PCR-based method on genomic DNA of various wheat species (Table ). Accessions used were Triticum monococcum, which represents the A genome; T. speltoides (two accessions) and T. longissima that represent relatives to the B genome, and T. tauschii as representative of the D genome of wheat. We included these two species to represent the B genome, since these are thought to be related to the as yet unknown ancestor. This yielded 230 unique DNA clones with high similarity to known α-gliadin genes (Table ) that were not present in the public databases. Only 31 of these sequences contained a non-interrupted full open reading frame (full ORF) α-gliadin gene. The great majority of the obtained sequences contained one or more internal stop codons or (rarely) a frameshift mutation (Table ). We refer to the latter sequences as pseudogenes. Remarkably, no full-ORF genes but only pseudogenes from T. longissima were found.
Figure 1 Schematic structure of an α-type gliadin protein. The protein consists of a short N-terminal signal peptide (S) followed by a repetitive domain (R) and a longer non-repetitive domain (NR1 and NR2), separated by two polyglutamine repeats (Q1 and (more ...)
Number of obtained unique full open reading frame (full-ORF) and sequences with one or more stop codons (pseudogenes) from various diploid Triticum species. Accession numbers are given between brackets.
A phylogenetic analysis of the deduced amino acid sequence of the full-ORF α-gliadin genes demonstrated a clear clustering of the sequences according to their genome of origin (Figure ). The sequences derived from the A genome (T. monococcum) as well as the sequences from the D genome (T. tauschii) each formed a separate cluster of relatively closely related genes in the phylogenetic tree. The sequences originated from the two T. speltoides accessions (B genome) formed a relatively diverse cluster. All five sequences derived from the two different accessions of T. speltoides differed from each other. Accordingly, the fact that the B genome sequences were more diverse is not an artifact from the use of more than one representative accession.
Figure 2 Dendrogram of a ClustalX alignment of the obtained full-ORF α-gliadin deduced proteins, which are indicated by their accession numbers (see Table 1). A PAM350 matrix and the neighbor joining method were used. Bootstrap values (of 1000 replications) (more ...)
To investigate whether the observed clustering of the sequences can be related to specific domains of the α-gliadin gene (Figure ), the first repetitive domain (R), the first (NR1) and the second non-repetitive domain (NR2) were used separately in a phylogenetic analysis (not shown). In all cases the sequences clustered according to their genome of origin and again the A (T. monococcum) and D genome (T. tauschii) sequences clustered separately in two groups with closely related sequences whereas the sequences originating from the B genome (T. speltoides) formed a more diverse group with nodes of high bootstrap values. Only when using domain NR2 no significant bootstrap values were attached to the nodes within this group.
The two polyglutamine repeat domains were analyzed for differences in the average number of glutamine residues. Figure shows large and also significant differences between the average lengths of the polyglutamine repeats depending on the genome of origin. The A genome (T. monococcum) coded for a significantly larger average number of glutamine residues in the first polyglutamine repeat than the B and D genomes. In the second polyglutamine repeat, the B genome showed a significantly larger number of glutamine residues than those of the other two genomes (Figure ). The analysis of the repeat domains indicates that nearly all α-gliadin sequences can be assigned to one of the three genomes using only the combination of both repeat lengths.
Figure 3 Analysis of the two glutamine repeats in the 31 obtained full-ORF α-gliadin proteins from diploid wheat species, according to the genome of origin. The average number of the glutamine residues in the first (Q1) and second repeat (Q2) are shown (more ...)
Analysis of the pseudogenes
The great majority of the α-gliadin genes contained one or more internal stop codons. We refer to them as pseudogenes, although we cannot predict from the genomic data whether a subset is being expressed. A question is how and when these pseudogenes did evolve. Therefore, we determined their position in the clustering of the three genomes, and the relationship with intact ORFs in the same loci. These pseudogenes are structurally similar to the full-ORF genes. The stop codons were nearly always located at positions where the full-ORF genes contained a glutamine residue codon. A stop codon was the result of a C to T change in 77.2% of the cases when compared with the full-ORF genes, altering a CAG or CAA codon for glutamine into a TAG or TAA stop codon. In addition, we observed that 15.5% of the stop codons were caused by T to A change, altering the codon for leucine (TTG) into a stop codon (TAG). Beside these major occurring substitutions we observed some C to A, C to G, G to T, and G to A changes. Twenty of the 199 pseudogenes contained a frameshift mutation (two were obtained from T. monococcum (A genome), two from T. tauschii (D genome) and 16 from T. longissima and the two T. speltoides accessions (B genome)). The changes into stop codons were not distributed randomly across the amino acid residue positions in the sequences, and they were not distributed evenly across the various diploid species. A high percentage of stop codons occurred jointly in one pseudogene, and many pseudogenes from one species contained the same set of stop codons, suggesting that they have been duplicated after the mutations created the stop codons (Figure ). A dendrogram of the deduced amino acid sequence of the great majority of non-frameshift pseudogenes, including the deduced amino acids downstream of the internal stop codon, closely resembled that of the full-ORF sequences. Only eleven percent of all pseudogene sequences clustered separately from the rest of the sequences of the same genome of origin.
Figure 4 Distribution of stop codons in the pseudogenes according to the amino acid position in the sequences. The positions of the stop codons are not distributed evenly across the various diploid species. The A genome sequences have a high percentage of stop (more ...)
To study the selection pressure on the obtained sequences the number of synonymous (Ks) and non-synonymous (Ka) substitutions per site were calculated from pair wise comparisons among the obtained full-ORF gene sequences and the pseudogene sequences (Figure ). The trendlines indicated a relative excess of synonymous substitutions compared to non-synonymous substitutions and showed a stronger excess for the full-ORF genes. Consequently, the mean Ka/Ks ratio for the genes was significantly lower than that of the pseudogenes (t test; P < 0.0005), indicating the occurrence of selection.
Figure 5 The relation of the relative numbers of synonymous substitutions (Ka) and non-synonymous substitutions (Ks) per site for pairwise comparisons among full-ORF α-gliadins and pseudogene sequences. The dotted line represents a Ka/Ks ratio of 1. Linear (more ...)
Since the first stop codons occur in various positions in the pseudogenes, it was not feasible to select a large number of sequences of sufficient and similar length to compare the selection pressure of the sequences up to the first stop codon with that of the sequences beyond it.
Analysis of sequences from hexaploid bread wheat
If the features described above that distinguish the α-gliadin genes from different diploid genomes, are present in hexaploid wheat in the same way, this would make it possible to assign the sequences as well as the known T cell stimulatory epitopes of α-gliadins from hexaploid wheat to one of the three loci, on chromosome 6A, 6B, or 6D. Since many hexapoid sequences are present in the public database of EMBL/Genbank/DDBJ, we tested this using the deduced amino acid sequence of these 56 full-ORF genes to build a phylogenetic tree (accession numbers are given in Table ). The sequences of hexaploid wheat clustered into three different groups (data not shown), as did the obtained sequences from this study, separated by a very high bootstrap value (998/1000). Joint analysis together with our full-ORF sequences from diploid species showed that the three groups coincide, and this allowed us to assign each of the genes of database sequences to one of the three Gli-2 loci (Table ).
Table 2 Number of T cell stimulatory epitopes present in full-ORF α-gliadin genes originating from T. aestivum according to the deduced genome of origin. Sequences are obtainedfrom the public databases. The α-gliadin locus is on chromosome 6, (more ...)
Analysis of CD-toxic epitopes
Our phylogenetic analyses show that the α-gliadin genes are distinct in their sequence conservation depending on the genomic origin. Are these patterns also being reflected in the occurrence of T cell stimulatory epitopes in the genes depending on their genomic origin? Table shows the number of perfect matches in the obtained full-ORF genes and in the pseudogenes to the four epitopes studied. The results demonstrate that the set of epitopes is indeed distinct for each genome. Firstly, in the A genome (T. monococcum) sequences, the epitopes glia-α9 and glia-α20 were present in all 17 different full-ORF genes and in 39 (glia-α9) and in 38 (glia-α20) of the 44 pseudogenes. However, the epitopes glia-α and glia-α2 were absent. Also among the database sequences from hexaploid T. aestivum the sequences assigned to chromosome 6A showed the same trend in epitope occurrence (Table ). Secondly, in the five obtained full-ORF sequences from the B genome species epitopes were completely absent except for two genes which contained the epitope glia-α only. Correspondingly, only four out of the 20 hexaploid wheat database sequences that were assigned to chromosome 6B contained epitope glia-α, whereas all others were without epitopes. Of the pseudogenes we obtained from the B genome species, 17% contained the glia-α epitope and only 3% the glia-α2 epitope, but these pseudogenes did contain the epitopes glia-α9 and glia-α20 at frequencies of 53% and 55%, respectively. Finally, in the 11 full-ORF sequences and the 64 pseudogenes obtained from the D genome, a frequent occurrence of all four different epitopes was found. This also applied to the five hexaploid wheat database sequences assigned to chromosome 6D.
Number of T cell stimulatory toxic epitopes present in full-ORF genes (upper panel) and pseudogenes (lower panel). N is the total number of genes used in the analyses.
Each epitope had its own position in the α-gliadin protein. Glia-α was in all cases present in the second non-repetitive domain (NR2), whereas glia-α2, glia-α9 and glia-α20 were all found in the first repetitive domain (R). A closer look at these sequences revealed that a single nucleotide polymorphism (SNP), which resulted in an amino acid change in a particular epitope, was present in most or all genes originating from one of the three genome. For example, Figure shows that the glia-α epitope in all of the full-ORF genes derived from the A genome were disrupted at the fifth amino acid of the epitope by the presence of an arginine (R) instead of a glutamine (Q). In three B genome sequences the glia-α epitope was disrupted at the second amino acid of the epitope by the presence of valine (V) instead of a glycine (G). A detailed overview of presence of the epitopes glia-α2, glia-α9 and glia-α20 in the obtained full-ORF sequences are shown in Figure .
Figure 6 Partial detailed alignment of the obtained full-ORF α-gliadin proteins. The figure shows the disruption of epitope glia-α (QGSFQPSQQ) by a single amino acid change in all T. monococcum (A genome) sequences andthree of the T. speltoides (more ...)
Figure 7 Partial detailed alignment of the obtained full-ORF α-gliadin proteins, showing the disruption of epitope glia-α2 (PQPQLPYPQ) in all T. speltoides (B genome), T. monococcum (A genome) and three T. tauschii (D genome) sequences. Secondly (more ...)
Here we show for the first time that large differences exist in the content of predicted T cell epitopes (glia-α, glia-α2, glia-α9, glia-α20) in full-ORF genes and pseudogenes from the diploid species. This phenomenon was also in hexaploid wheat. None of the diploid A genome sequences and none of the sequences from chromosome 6A in the hexaploid bread wheat contained glia-α and glia-α2 epitopes (Table and ). In contrast, the sequences from the D genome contained all four epitopes at high frequencies, both in the diploid species and in the hexaploid bread wheat. For the B genome, the five diploid and 20 hexaploid full-ORF sequences rarely contained the epitope glia-α and did not contain one of the other three epitopes. Based on this analysis, we predict that among the α-gliadin proteinss, those coded by the B genome are the least likely to stimulate CD4 T cells. Remarkably, the pseudogenes revealed the presence of all the epitopes. In these analyses we have assumed that a single amino acid substitution is sufficient to prevent such peptides from stimulating the T cells, especially since the substitutions often concern a glutamine residue. Glutamine residues can be deamidated to glutamic acid by tTG in the human gut providing the negative charges necessary to enhance binding in the DQ2 groove [9