|Home | About | Journals | Submit | Contact Us | Français|
Disease response genes (DRGs) diverge under recurrent positive selection as a result of a molecular arms race between hosts and pathogens. Most of these studies were conducted in animals, and few defense genes have been shown to evolve adaptively in plants. To test for adaptation in the molecules mediating disease resistance in the cereals, we first combined information from the expression pattern of Sorghum bicolor genes and from divergence to the full genome of rice to identify candidate DRGs. We then used evolutionary analyses of orthologous gene sets from several grass species, to determine whether the DRGs show signals of positive selection and the residues targeted. We found 140 divergent genes upregulated under biotic stress in S. bicolor by evaluating the relative abundance of expressed sequence tags in different libraries and comparing them with rice genes. For 10 of these genes, we found sets of orthologs including sequences from rice and three other cereals; six genes showed a pattern of substitution that was consistent with positive selection. Three of these genes, a thaumatin, a peroxidase, and a barley mlo homolog, are known antifungal proteins. The other three genes with evidence of positive selection were a MCM-1 agamous deficiens SRF- (MADS) box transcription factor, an eIF5 translation initiation factor, and a gene of unknown function but with evidence of expression during stress. Permutation analyses, using different ortholog and paralog sequences, consistently identified five positively selected codons in the peroxidase, a member of a cluster of genes and a large gene family. We mapped the positively selected residues onto the structure of the peroxidase and thaumatin and found that all sites are on the surface of these proteins and several are close to biochemically determined active sites. Identifying new positively selected plant disease resistance genes and the critical amino acid sites provides a basis for functional studies that may increase our understanding of their underlying molecular mechanisms of action. Additionally, it may lead to the identification of individuals having variation at functionally important sites, as well as eventually using this information in the rational design and engineering of proteins involved in plant disease resistance.
Evolutionary theory predicts that the long-term interaction between host and pathogen populations should lead to an arms race, preventing organisms in either population from achieving an optimal genotype (Maynard Smith 1998). This competition for increased fitness in the face of constant change is expected to lead to recurrent events of positive selection whenever advantageous mutations appear that increase the capacity for disarming the other member of the antagonistic couple.
The unambiguous identification of positive selection is one of the most important endeavors in modern population genetics and comparative genomics (Nielsen 2005). Recently, numerous studies have identified positively selected disease response genes (DRGs) in several species of animals (Nielsen 2005; Schlenke and Begun 2005). However, although a large number and diversity of angiosperm DRGs have been described (Hammond-Kosack and Jones 1997; Michelmore and Meyers 1998; van Loon et al. 2006), only those from three categories have been shown to have evolved under positive selection by looking at the pattern of nucleotide substitutions in orthologous genes. The first category corresponds to putative pathogen recognition proteins of the nucleotide-binding site and leucine-rich repeat (LRR) domain type (NBS-LRR) and is divided into two subcategories depending on the domain present at the N-terminus, namely, the Toll-Interleukin 1 Receptor or the coiled-coil domain, that is, the classical, dominant, disease resistance (R) genes (Mondragón-Palomino et al. 2002). The second category consists of pathogen-wounding enzymes, namely, chitinases (Bishop et al. 2000; Tiffin 2004) and glucanases (Bishop et al. 2005) that degrade fungal-cell walls and therefore act as pathogen-wounding proteins but have traditionally been called pathogenesis-related proteins or PRPs due to their synthesis in plants as a response to microbial attack (Bowles 1990; Sticher et al. 1997). The third category consists of polygalacturonase inhibitor proteins (PGIPs), which bind pathogen-derived enzymes that degrade the pectic oligomers in the cell wall, restraining their action (Stotz et al. 2000).
Although these types of plant DRGs include numerous genes (e.g., in rice the number of canonical R genes is ca. 600; Goff et al. 2002; Yu et al. 2002) and are essential for the functioning of several disease resistance mechanisms, they represent two extremes of the process, that is, the recognition of the pathogen at the beginning of infection and the molecular clash with the pathogen in a direct and often highly specific manner. Many genes that are important in these as well as other stages of defense, such as transcription and translation factors, have not been assessed for signals of historical molecular adaptation in plants. Interestingly, Bakker et al. (2008) studied recently the genetic diversity at 27 loci in Arabidopsis involved in the Salicylic Acid, Jasmonic Acid, and Ethylene disease signaling transduction pathways and found that most are under strong purifying selection, with only hints of positive selection in eight genes. Such an unexpected result, contrasting with the pattern of polymorphism found for R genes in the same panel (Bakker et al. 2006), is intriguing and warrants further investigation of molecular evolution in genes from signal transduction pathways and other downstream elements of disease resistance. It is essential to determine whether other types of plant DRGs, known or yet unannotated as such, have evolved under positive selection, particularly considering the strong and constant pressure imposed by pathogens, or whether there are unknown constraints to positive selection in the genome of plants that have lead to the current dearth of examples of adaptive evolution in plant genes downstream of NBS-LRR receptors.
Evidence for adaptive evolution can be detected by maximum likelihood (ML) codon-based analyses of the ratio of nonsynonymous substitutions per nonsynonymous site to that of synonymous substitutions per synonymous site. These tests have revealed numerous genes in many organisms for which particular domains or even single codons have evolved under positive selection (Nielsen and Yang 1998; Yang et al. 2000; Nielsen 2005). This method is most powerful provided that there are sets of sequences from orthologous genes available for several species (Nielsen 2005), which is increasingly the case in the grasses.
The recent exponential increase in the availability of sequences from various cereal species in public databases, along with information on their expression patterns, provides an opportunity to identify and compare orthologous genes—which may also have a common function—across the grasses. Information available from the annotated rice genome provides a framework for correctly identifying, organizing, and comparing the orthologs from other grass species and for transferring useful information across the cereals (Jaiswal et al. 2002). Additionally, the phylogeny of cereals and a few other grasses has been thoroughly studied using numerous morphological and biochemical characters (Barker et al. 2001); therefore, a reliable consensus phylogeny for subfamilies and tribes exists that can be used as a null model to study the pattern of substitutions in the genes of interest in an evolutionary framework. When comparing genes from the cereal species, the ML approaches to modeling gene evolution have the additional benefit of looking at millions of years of gene evolution along the different lineages within Poaceae, since the most recent common ancestor between rice and sorghum (the most diverged lineages evaluated here) existed at least 50 Ma (Barker et al. 2001; Prasad et al. 2005). More importantly, these methods are less sensitive to problems caused by changes in the patterns of genetic diversity resulting from nonequilibrium demography, which can mimic any and all of those generated by selection (Nielsen 2005).
A useful pathosystem for identifying candidate disease resistance genes in the grasses is that of Sorghum bicolor and its fungal pathogen Colletotrichum sublineolum, the causal agent of anthracnose. Different strains of anthracnose affect the stalk, foliage, panicle, and grain of sorghum, and losses of up to 50% have been reported. This fungal pathogen has high genetic diversity (Rosewich et al. 1998) and its ability to change rapidly through meiosis and parameiosis, generating strains with increased pathogenicity (Souza-Paccola et al. 2003), suggests it may have had an important role as a selective factor and consequently as a driving force behind rapid gene evolution in its host. Analysis of the inheritance of anthracnose resistance (Mehta et al. 2005) suggests the involvement of several major genes (so far unknown), as well as a complex genetic basis also suggested by the variation for resistance in landraces from several locations in Africa (Erpelding and Prom 2006). Furthermore, different but closely related species, Colletotrichum falcatum and Colletotrichum graminicola, infect sugarcane (Suman et al. 2005) and maize (Crouch et al. 2006), respectively, suggesting that these fungal pathogens might have coevolved with their host plants for perhaps several millions of years, because sorghum, sugarcane, and maize are members of the tribe Andropogoneae (Poaceae; supplementary fig. 1, Supplementary Material online). Such antagonistic coevolution would have left its mark on the genomes and the molecules involved, increasing the probability of identifying genes that have evolved under selection in the cereals due to the pressure imposed by this pathogen.
In addition to the recently sequenced genome (Phytozome), there are over 200,255 expressed sequence tag (ESTs) available for S. bicolor, from 20 different and nonnormalized EST libraries. Two of these libraries were made after inoculating either resistant or susceptible sorghum plants with C. sublineolum, the anthracnose pathogen (Pratt et al. 2005), producing incompatible and compatible reactions, respectively. These two libraries have been used to conduct EST-based differential expression analyses using custom-designed software and to identify genes upregulated as a result of inoculation (Pratt et al. 2005). In this paper, we describe the identification of biotic stress induced genes from Sorghum, some of which were not previously known to be involved in response to anthracnose or other (fungal) diseases. We also generate sets of putative orthologous genes in the cereals and demonstrate that several show statistical evidence for recurrent positive selection. Finally, we identify the position of positively selected amino acid residues in the tertiary structure of thaumatin and peroxidase, both previously described PRPs.
In S. bicolor, the Biotic Stress subgroup of EST libraries is comprised of three EST libraries called the Pathogen induced Incompatible (PI1, resistant reaction, 9,533 ESTs), the Pathogen Infected Compatible (PIC1, susceptible reaction, 10,209 ESTs), and the Salicylic Acid treated library (SA1, 5801 ESTs) (see Pratt et al. 2005 for a detailed description of the 20 S. bicolor EST tissue- and treatment-specific libraries and analysis software). We used the program MAGIC Gene Discovery (Laboratory for Genomics and Bioinformatics, University of Georgia) to identify uniscripts constructed using ESTs that came primarily from these three libraries. A uniscript is defined here as a unique splicing form of a gene model made from ESTs that can come from only one (tissue or treatment specific genes) or many EST libraries (e.g., housekeeping genes). We selected uniscripts having more than 50% of ESTs expressed in the biotic stress subgroup of libraries and with at least three ESTs composing the uniscript. A believability score was generated by Pratt et al. (2005) for each of the uniscripts, based on the method of Stekel et al. (2000). This believability score reflects the probability that a gene is upregulated in a particular library and we report it for the genes used in this study. We also used MAGIC to obtain a control set of 700 constitutively expressed uniscripts, having a library ratio of 0.05 (no more than 5% of ESTs from any of the 20 libraries) and more than 20 ESTs.
We used the transeq program (EMBOSS; http://www.ebi.ac.uk/emboss/transeq/) to translate all the 773 pathogen induced uniscripts, as well as the control set, in all six frames and compared these translations with the rice proteins (trabecular meshwork inducible glucocorticoid response, TIGR) using BlastP to obtain the most likely orthologs in rice and to identify highly expressed and divergent genes. In order to identify known protein domains in the uniscripts, we used parallel InterProScan, implemented at the Computational Biology Service Unit (CBSU, Cornell University) and compared this annotation with that precalculated by the MAGIC gene discovery tool, which compared all the uniscripts with the PIR database using TBlastX. We then obtained the Gene Ontology (GO Slim) annotation for the rice genes from the TIGR website, using the LOC_Os identification codes in Batch download.
To identify putative orthologous genes in other cereal species, particularly in sugarcane, maize, barley, wheat, and rice, we used parallel-BlastN (PBlastN), as implemented in the computer cluster of the Computational Biology Service Unit (Cornell University), and conducted a similarity search with default parameters using the set of Sorghum biotic stress upregulated (BSU) genes (n = 773) as a query and all the nucleotide Genbank sequences (nr) as database. To identify uniscripts with hits to the species of interest, we used SQL queries specifying the scientific names of the acceptable target species. We used SQL queries to select those uniscripts that had an E value suggestive of moderately rapid evolution (where the best hit was in the interval: 10−50 > E >10−5) and then used Microsoft Excel to parse the SQL results and to identify those uniscripts with hits to three or more species having mRNAs and particularly those having prior information suggesting a function in disease resistance.
To corroborate best hits and reciprocal best hits, we used the sequences from all species in the selected sets of orthologous genes in subsequent BlastN searches using other databases such as the EST_others and genome survey sequences (GSSs). Good hits to ESTs, NCBI unigenes, and genomic sequences such as High CoT and methylation filtration sequences were downloaded using Batch download from NCBI and used to construct strict contigs in Sequencher (Codon Codes, Ann Arbor, MI) with identity greater than 95% and overlap of more than 20 bases, allowing for large gaps to align the genomic sequences to the expressed sequences. These contigs were then used as input in FGenesH (Salamov and Solovyev 2000) for ab initio gene prediction and translation of the open reading frames (ORFs) into protein sequences. These predicted genes were compared with the predicted genes in rice, which were obtained also by FGenesH (Goff et al. 2002; Yu et al. 2002) and have been corroborated in many cases by full-length mRNA analyses. This strategy was initially used to identify full ORFs in sorghum, but because this work was conducted while the sorghum genome was being sequenced, we used the early genome drafts to corroborate our results and to obtain further gene models (Phytozome, http://www.phytozome.net/sorghum).
In order to select the most likely ortholog from each species, we required that all the exons aligned well from the beginning to the end with few gaps. If more than one such gene from a species was found in a cluster of orthologs, we chose the one that was the closest match to sorghum over the entire length of the coding region.
Sequences that appeared in unexpected positions in the species phylogeny, for instance, a maize gene with greater divergence to sorghum than that of the rice ortholog, were considered old, divergent paralog genes. This strategy assumes that most of the substitutions are neutral and therefore the genealogy reflects the phylogenetic relationships of the species, whereas those sites that are under selection are noninformative or are very few compared with the neutral sites and therefore do not affect the overall topology.
We realigned the nucleotide sequences based on the protein sequence alignments using protal2dna (Schuerer K, Letondal C, Pasteur Institute bioinformatics) or tranalign (EMBOSS package). These nucleotide sequence alignments were used to identify the genes and the particular codons under positive selection using the ML methods described by Nielsen and Yang (1998) and Yang et al. (2000) and implemented in the codeml algorithm from the PAML package (version 3.13), as well as by the Fixed Effects Likelihood (FEL) and Random Effects Likelihood (REL) models described by Kosakovsky Pond et al. (2005) and implemented in the online version of the HyPhy package (http://www.datamonkey.org). Regions with gaps were not included in the analyses of these orthologs and we refer to these sets as NG (No Gaps).
To assess the effects of including paralogs in the determination of adaptive evolution, we generated two additional sets of peroxidase genes (Paralog Sets 1 and 2, PS1 and PS2). We downloaded all the peroxidases from sorghum and other cereal species with high to moderate matches to the upregulated sorghum gene, and made phylogenetic trees using the UPGMA algorithm with 2,000 bootstrap replications (as implemented in Megalign). We used all the obtained paralogs from all the species together in order to identify those genes from different species that clustered together in a subtree resembling the species phylogeny as an indication of orthology, as well as paralogs showing different levels of divergence.
We used the orthologous gene sets to construct likelihood ratio tests (LRTs) to determine whether these genes have evolved under selection. We first compared a null model (M0) with a single dN/dS (ω) rate with one with more classes (M3) to test whether the latter model fits the data better. Then, we made an LRT for the case where the reduced model includes 10 classes of ω within the beta distribution, that is, where classes range from strong negative selection to neutrality in the interval between 0 and 1 (M7), and the unrestricted model where there is an additional class of sites where ω can be greater than 1 (M8), a strong indication of positive selection.
Candidate genes thaumatin (2_8981) and peroxidase (2_7192) have close homologs for which the crystal structure of the protein has been determined, and we downloaded their PDB files (peroxidase 1H5D, thaumatin 1RQW) from the protein databank (www.rcsb.org_pdb). We used Swiss Deep View and Modeller (in parallel at CBSU) to align the candidate gene protein product sequence to that of the homolog and to minimize the energy in the model. We then used Deep View to map the amino acid sites predicted to be under selection to the 3D model of the protein.
Of a total of 57,673 S. bicolor uniscripts, 773 showed increased level of expression in the biotic stress subgroup of EST libraries under the conditions defined in MAGIC Gene Discovery, that is, those uniscripts have more than 50% of their ESTs belonging to this subgroup and more than three ESTs in total. On average, these BSU uniscripts have a length of 739.9 base pairs (standard deviation, SD = 103.1 bp), are made up of 7.1 ESTs (SD = 13.2), and have a biotic stress ratio of 0.67 (SD = 0.16). A few of these uniscripts (n = 9, 1.2%) appear to be contaminations of fungal origin, based on TBlastX to the genome of Neurospora crassa, and they were eliminated from the analysis.
In order to annotate the BSU genes and to determine how divergent they are, we used BlastP to compare them with rice proteins (TIGR v.3) and identified 505 sorghum BSU uniscripts with significant (E value < 10−5) hits in rice (fig. 1 and supplementary fig. 2, Supplementary Material online). The mean divergence, in terms of E value, was 10−45, with a SD of E = 10−23. The best fit line shows a subtle increase in the level of expression related to evolutionary conservation, particularly for the lowest categories of expression (three to six ESTs, fig. 1). However, overall, no significant correlation exists between the level of expression and the degree of conservation for this set of sorghum BSU uniscripts (Correlation coefficient = −0.026). Most genes (92%) had less than 10 ESTs, regardless of their divergence, and most conserved genes (BlastP, E < 10−45) show a low level of expression. Interestingly, however, there are several genes that show high expression and divergence (lower right quadrant of fig. 1), suggesting that conditionally expressed DRGs may include some members that have diverged under positive selection.
A comparison of the Gene Ontology (GO) annotation for the best hit in rice for all the 505 sorghum BSU uniscripts (supplementary fig. 3, Supplementary Material online) and a set of 700 constitutively expressed genes (supplementary fig. 4, Supplementary Material online) revealed that although kinase activity is the main molecular function identified in both groups, there was a 6% increase in the BSU genes. Two other main categories, DNA binding and protein binding, increased as well (both 3%), whereas catalytic activity, hydrolase activity and transcription activity all decreased substantially (14%, 8%, and 7% less, respectively). Many categories are exclusively present in the BSU gene set, including calmodulin binding (3%), oxygen binding (3%), carbohydrate transporter (2%), lipase (2%), lipoxygenase (1%), and calcium- and calmodulin-dependent protein kinase activity (1%). Of these 505 BSU genes, there were several with annotation strongly suggesting a role in disease response, including NBS-LRR genes (n = 7), NB-ARC domain genes (n = 5), lectins (n = 3), peroxidases (n = 2), thaumatins (n = 2), and one gene of each of the following: Enhanced Disease Resistance (EDR1) homolog; chitin inducible gene; Pathogenesis Related Protein 1b, and MLO homolog. Interestingly, PGIPs were not found in the BSU gene set but were present in the constitutively expressed control gene set.
In order to test for signals of selection in the BSU uniscripts, we attempted to identify orthologs in other cereal species for each of the 773 BSU uniscripts (see Materials and Methods), and used codon-based ML tests of neutrality and positive selection. From these sets, we selected for further analysis genes that met the following criteria:
Application of these criteria yielded 140 genes for further analysis. For each of these candidates, we attempted to reconstruct the full-length ORFs by additional BlastN searches against the GSSs and EST_others databases, that is, genomic and expressed sequences (NCBI, 06/2005). Moreover, these additional sequences allowed us to empirically verify the gene prediction made by FGENESH by comparing the genomic and expressed sequences. Evidence of a role in plant defense in any of the species was also used to assign priority in the reconstruction of the set of orthologs, because this would indicate a conserved function in different species and prima facie evidence of a correct annotation as a DRG in sorghum and other cereals. Only sets with complete ORFs were used to test for positive selection, as this is one of the conditions needed to assure orthology and a good alignment. We were able to reconstruct a total of 10 sets of orthologous ORFs using this strategy.
The 10 sets of orthologs were tested for evidence of positive selection using various models in the PAML package (table 1). Six of the 10 genes were moderately divergent also at the protein level (BlastP E < 10−50) when comparing sorghum with rice and all of them have sites predicted to be under positive selection (table 1). On the contrary, four genes were not as divergent when the complete peptide was used and had BlastP E values greater than 10−50 when compared with rice proteins (table 2).
Although three of the more conserved genes, 2_12005, 2_6529, and 2_11011, have 50% of their ESTs coming from the Biotic Stress group of EST libraries (supplementary fig. 5, Supplementary Material online), their annotation does not suggest a role in disease resistance mechanisms and were found to be under strong purifying selection only (table 1). Another moderately conserved gene, 2_9185, appears to be evolving primarily under purifying selection as well. However, this gene is highly expressed (fig. 1) and 74% of its ESTs come from biotic stress libraries (supplementary fig. 5, Supplementary Material online). Additionally, this gene belongs to a family that has been implicated in disease resistance (Langlois-Meurinne et al. 2005; Wisser et al. 2005).
The comparison of model M3 with M0 (table 1) shows that all 10 genes have variation for dN/dS (ω) ratios among their codons, that is, selective constraints are heterogeneous, as expected. For the four conserved genes, model M8, which allows for some sites to have ω greater than 1.0, was not better than M7, indicating that all amino acids belong to classes ranging from strict purifying selection to neutrality. For the six moderately divergent genes, the M8 model was significantly better than M7, providing strong evidence for positive selection at some codons. We also used the FEL and REL algorithms from HyPhy (Kosakovsky Pond and Frost 2005) and found evidence of positive selection only for the peroxidase set. Two sites, 105E and 117Q had dN–dS values of 2.24 and 1.15, respectively. These two sites were also identified using codeml as positively selected with posterior probabilities of 0.95 and 0.99, but three other sites identified consistently and with high posterior probability (>0.90) were not identified as such by HyPhy. All other sets of genes were found to have sites evolving only under significative negative selection by HyPhy.
Out of the six genes that showed evidence of positive selection, an average of 5.3 codons per gene (SD = 3.0) had posterior probabilities greater than 0.90 of belonging to the class of sites with ω > 1. Thus, taking gene size into account, 2.34% of the codons per gene are estimated to have evolved adaptively in this set of genes. In total, there are 32 codons with ω > 1, and an average of 4.2 different amino acids per site. Assuming that a single nonsynonymous substitution lead to the observed residues, there are a total of 134.4 fixed substitutions for the six genes, which divided by the sum of the length of the branches in the phylogeny (172–253 My, supplementary fig. 1, Supplementary Material online) (Ramakrishna et al. 2002; Paterson et al. 2004; Rice Chromosome 3 Sequencing 2005), gives an estimated 0.53–0.78 adaptive substitutions per million years, that is, approximately five to eight adaptive mutations becoming fixed every 10 My across all species used in the analysis. Also, if there are 2.34% adaptive codons per gene, 4.2 substitutions/codon, 300 amino acids in the protein and approximately 12.000 genes involved in disease resistance, there should be approximately 1.39–2.05 adaptive mutations fixed every 1,000 years in any of the lineages in the phylogeny (supplementary fig. 1, Supplementary Material online). This analysis is only taking into account the replacement substitutions and not silent substitutions, which could also be adaptive (Komar et al. 1999; Kimchi-Sarfaty et al. 2007). Additionally, there are many insertions and deletions in the multiple species alignments and it is likely that some of them could have a functional effect. Finally, there is a chance that changes in regulatory regions, which have not been covered here, could also have an important effect in increasing the adaptive mutation rate to levels similar to those of pathogens (Herring et al. 2006; Taubes 2008).
Three of the genes identified here as evolving adaptively are known DRGs. Thaumatins, peroxidases, and barley mlo homologs have been previously shown to undergo changes in expression due to pathogen attack and to have a role in disease resistance (see Discussion). Although both thaumatin (2_8981) and peroxidase (2_7192) genes are mainly expressed in resistant plants (Pathogen Incompatible, PI1 library; supplementary fig. 5, Supplementary Material online), the former has a higher level of expression, which may suggest rapid response to pathogen stimuli or genetically determined upregulation during early seedling development. In fact, for uniscript 2_8981, 12 ESTs came from the PI1 library and 5 from other unrelated libraries (Steckel reliability value, R = 28.2; Pratt et al. 2005), indicating a low to moderate level of constitutive expression that increases significantly after inoculation with the pathogen. Because these ESTs were found only in challenged but healthy plants, it is arguable that they might play a significant role in resisting the pathogen's attack. Although there are many paralogs for both thaumatins and peroxidases in the genome of sorghum, only two of each of these genes appear to be upregulated in the available libraries.
The mlo homolog (2_11684) was the only one of these known DRGs expressed in SA-treated plants. In the genome of S. bicolor, 2_11684 is a member of a small family with 12 genes. This gene is a singleton in a chromosomal region, just like other seven members in the family, whereas other four genes are clustered together (Phytozome; Sbi_0.29522). Remarkably, 2_11684 has a large number of small internal exons surrounded by large introns, suggesting the possibility of alternative splicing.
The other three positively selected genes have not been previously implicated in disease responses and here we present additional information on them.
This single-copy gene is highly upregulated in S. bicolor after inoculation with C. sublineolum and is very divergent with respect to its rice ortholog (fig. 1). Ten of its 14 ESTs come from the Biotic Stress libraries, particularly from susceptible plants (supplementary fig. 5, Supplementary Material online). ML analysis at this gene identified four main regions of high amino acid sequence conservation, where most codons evolve under purifying selection (fig. 2). Two highly conserved regions have the sequence SESPYFGSSVHYG (located at residues 100–110) and ATRGDWWQGSLYY at the C-terminus of the protein. However, there are nine hypervariable sites (posterior probability > 0.90), making SESPY the gene with the most sites predicted to have evolved under selection in our analysis. Positions 51 and 93, for instance, occur amidst regions of strict conservation across all species of grasses strongly suggesting a functional role.
This gene is a member of the MADS-domain family, which encode eukaryotic transcription factors with sequence-specific DNA-binding characteristics. It belongs to the Myocyte Enhancer Factor 2 like subfamily of MADS genes, the plant-specific MIKC-type, and includes a K-box domain. This family of eukaryotic transcription factors include a DNA-binding and dimerization domain and has been shown to be important in homeotic regulation in plants (Marchler-Bauer et al. 2009) (NCBI, Conserved Domain Database, 07/2007). The rice gene Os01g0922800 (RefSeq peptide NP_001045235.1) is the ortholog of S. bicolor 3g044170 gene model (Phytozome) that corresponds to the uniscript 2_4543. There are many paralogs in rice and Arabidopsis and the few whose functions are known are involved in flower organ specification (Arora et al. 2007). Arora et al. (2007) studied the expression profiles of the MADS-box genes in rice and Arabidopsis and found that most of the basic structures and functions are conserved, but there is evidence of new function acquisition, duplication and subfunctionalization. There is no evidence in those model genomes of involvement of the ortholog of 2_4543 in disease response, and thus, it may be that the three ESTs found only in the PIC1 library were there just by chance. This gene is overall the most divergent of those presented here (table 2) and although it shows high conservation in the amino terminal portion of the protein that corresponds to the DNA-binding MADS-box domain, the ML tests suggest there are three codons that have changed repeatedly through nonsynonymous substitutions along the protein. The carboxy terminal portion of this protein is hypervariable and was not included in the ML analysis due to the difficulty of obtaining a satisfactory alignment and to the false positive errors this can generate. Such variability could reflect either a lack of constraint, positive selection or both.
In S. bicolor, 57.5% of the 14 ESTs that make up this uniscript come from the Biotic Stress libraries (table 2) and most of them were found in the salicylic acid (SA1) library, suggesting a change in expression response due to this plant defense signal molecule (supplementary fig. 5, Supplementary Material online). 2_3804 is an eIF5 and contains a putative zinc-binding C4 finger and the InterPro domains Armadillo-type fold (InterPro:IPR016024) as well as translation initiation factor IF2/IF5, N-terminal (InterPro:IPR016189).
Crystal structures have been determined for protein homologss of two of the genes identified here as having evolved under positive selection, namely, the thaumatin (PDB: 1RQW) and peroxidase (PDB: 1H5D) genes. Figures 3 and and44 show models of the sorghum proteins based on these available structures (see Materials and Methods). The Root Mean Square is 0.00 Å (using 150 alpha carbons) for thaumatin and 0.10 Å (with 264 alpha C) for peroxidase, which means that the protein sequences aligned very well to their respective homolog's structure, because the alpha carbons of both are very close in space. Moreover, the final total energy of the models was −7,080.58 kJ/mol for thaumatin and −5,223.24 kJ/mol for peroxidase, and there were no amino acids clashing with each other. We used these structures to identify the position of the positively selected amino acid residues in the tertiary structure. In both cases, all the amino acid sites identified as being under positive selection are located on the surface of the protein, where they may be in contact with the solvent and other molecules. In the case of the thaumatin protein, most of the positively selected residues are located in the loops of the structure and on the fringe (fig. 3A and B), toward the concave side of the protein (fig. 3C). No positively selected sites were found either on the central concave portion or on the opposite side of the protein (fig. 3D). The core of thaumatin-like proteins is made up by the central portion of the peptide sequence and folds into 11 beta sheets (De Vos et al. 1985). Shatters et al. (2006) found that this core is highly conserved and that most of the variation observed across very divergent species is found in the surface exposed loops, which is consistent with our determination of the position of putatively selected residues in these regions of the protein (fig. 3).
The structure and biochemistry of the horse radish peroxidase C has been extensively studied and the location of the active site pocket and key functional residues are known (Gajhede et al. 1997). A hypervariable region between alpha helices F and G, the F′ and F″ helices, identified in Class III peroxidases (higher plant peroxidases, (Gajhede et al. 1997) (corresponding to residues 153–180 in our model) appears to be conserved in our set of species (fig. 4). However, we found a hypervariable region, the E helix, containing three amino acid sites with posterior probabilities greater than 0.90 of being under positive selection and in which conserved and variable residues alternate. This helix is located to the side of the active site and several of these amino acid sites interact directly with the heme group (Gajhede et al. 1997). Additionally, in the peroxidase (fig. 4), seven of the eight positively selected sites are located within 20 Å of the oxygen atoms in the propionate side chain of the tetrapyrrole or heme ring (in red), and primarily toward one side of the protein. Four residues, Gln117, Ser121, Leu122, and Ser124, are located in alpha helix E (fig. 4B) located 90° to the side of the catalytic site and some amino acids in this alpha helix may be in contact with and help coordinate the “heme” group, due to the positioning of the heme and in particular due to the proximity of that helix to the methyl and vinyl side chains of one of the pyrrole rings. Remarkably, in the sorghum peroxidase, two positively selected sites, Lys43 and Ser148, are located at the edge of the active site cleft, 8 and 6 Å, respectively, from the oxygen atoms in the propionate side chain, suggesting that they may be important in the catalytic activity of the protein (fig. 4). Moreover, an additional positively selected site is 7 Å from one of the calcium ions, essential for catalysis in peroxidases.
Two of the genes identified as having evolved under positive selection are members of gene families that include copies showing high similarity, a feature in common with many other plant genes. Peroxidases and thaumatins have paralogs in multiple different positions in the genomes of plants and can be arranged as clusters or single-copy genes. Thus, unknowingly using paralogs in the evolutionary analyses could cause over or underestimations of the number of genes found to be under positive selection, as well as of the residues involved in that process. Including paralogs could also result in the identification of signals of adaptation due to subfunctionalization, instead of adaptation (in the form of increased resistance to the pathogen) due to changes in true orthologous genes, the main goal of this study. In order to assess these possibilities, we first used the information available from the rice and sorghum genome projects (Gramene; Phytozome) to determine whether the genes used from these two species were located in syntenic positions. We then implemented a paralog permutation strategy to evaluate the robustness of the analysis to the use of increasingly divergent paralogs.
In both peroxidase and thaumatin, the genes used in the previously described analyses are located in the corresponding regions in the genomes of rice and sorghum (supplementary figs. 6 and 7, Supplementary Material online). Both kinds of genes are arranged locally as small clusters in both genomes: peroxidases in chromosomes 7 (26.67 Mb) and 2 (76.6 Mb) of japonica rice and sorghum, respectively, and thaumatins in chromosomes 12 (26.8 Mb) and 8 (54.4 Mb) in the same order. Within these clusters, each of the copies in rice are the best reciprocal ortholog, according to Blast searches (precomputed at Gramene), to all the copies in the sorghum cluster, and vice versa, that is, members of a cluster are very similar to each other and equally divergent to each one of the copies in the other species (supplementary fig 6 and 7, Supplementary Material online). Interestingly, however, a phylogeny of the peroxidase genes from several cereal species (supplementary fig. 8, Supplementary Material online), including genes located in clusters found in syntenic positions in sorghum and rice (including 2_7192, the sorghum BSU gene) shows that the genes at both flanks of the clusters are more similar to its ortholog in the other species, than to genes within its own genome, that is, the genes at the edges of the cluster have a 1 to 1 relationship of orthology, whereas those in the center of the cluster vary in their similarity. Genes in positions 1 and 2 from left in both species’ clusters, as well as those in the right flank of the cluster have a single ortholog. The sorghum gene that corresponds to the 2_7192 uniscript is located in the center of the cluster (Sb02g042860.1) and it has the same position, third from left, as the rice gene used in the initial comparison (Os_NP_001060628.1). However, this rice gene clusters only with its neighbor, Os_NP_001060629.1, suggesting a recent duplication or concerted evolution. The next genes from left, Sb02g042870.1, in sorghum, and Os_NP_001060631.1, in rice, appear in distant positions in the phylogeny. Additionally, there are two gene fragments or pseudogenes in sorghum, (Sb02g042855.1 and Sb02g042875.1) amidst the complete homologous ORFs. Hence, the sorghum and rice genes used in codeml evolutionary analyses are the correct orthologs based on reciprocal Blast best hits, synteny and genealogy.
Because the method used to identify the orthologs in these two species was the same used to identify orthologs in other cereal species, it may be safe to argue that there is a low probability that we used paralogs in the analyses. Nevertheless, it is a possibility because we did not have enough information to determine synteny for the genes of other species. Therefore, in order to assess the effects of using of paralogs on the number of positively selected sites, we conducted a series of permutations with increasingly divergent paralogs and compared the results. To do this, we used the peroxidase phylogeny (supplementary fig. 8, Supplementary Material online) to generate two different data sets by first, fixing the orthologous genes for sorghum and rice, and changing the other three sequences (Paralog Set 1, PS1, supplementary data: Alignments, Supplementary Material online) and, second, leaving only the sorghum gene unchanged (Paralog Set 2, PS2). The first data set (PS1) included the most divergent sequences from each species, as indicated by their long branches, but found in the position in the genealogy corresponding to the species phylogeny (supplementary fig. 8, Supplementary Material online). The second data set (PS2) included more divergent paralogs, located in outer branches of the phylogeny, which do not correspond to the species tree and that can be expected to have diverged faster or for a longer time (supplementary fig. 8, Supplementary Material online). We used the full-length alignment of the coding regions in all the replicates for the ML analyses for these two sets, without eliminating the regions with gaps, but we visually inspected and modified the ClustalW alignments to improve them. We then compared the results of these two sets with those from the set of orthologs without gaps (NG, average of seven different runs, supplementary data: Comparison of codeml results for 2_7192, Supplementary Material online).
The analysis of Paralog Set 1 (PS1) showed that there were 13 and 4 putative positively selected sites with posterior probability greater than 0.5 and 0.9, respectively, compared with the 17 and 6 sites identified in the orthologs without gaps (NG) analysis. The slight difference is not statistically significant (Fisher exact test, P > 0.05), suggesting that the choice of sequences did not affect the number of sites identified to be under positive selection. Remarkably, however, there were five sites that were identified as being under positive selection in both the NG and PS1 sets, even though the sequences and alignments used were different. Four of these five sites had posterior probability >0.88 in the NG set and they include K43 and S148, located at the edges of the active site cleft (fig. 4A), as well as Q117 and L122 located in the hypervariable E helix (fig. 4B). Additionally, S246 had posterior probabilities of 0.73 and 0.95 in the NG and PS1 sets, respectively and is located in the L alpha helix (amino acid numbers refer to the NG alignment).
For the most divergent paralogs (PS2), the number of positively selected sites decreased to 11 and 1 sites with posterior probabilities >0.5 and >0.9, respectively, as the average of four replicates with the same sequences but with different unrooted trees provided by the user. Surprisingly, there were five sites identified in common both with the NG set and four with PS1. These sites are P40, K43, Q117, L122, and S148. The amino acid residue P40 had a posterior probability of 0.74 in NG but had an average of 0.998 in PS2, which makes it interesting, particularly due to its close position to the active site cleft. All the other residues had posterior probabilities ranging from 0.5 to 0.77 (supplementary data: Comparison of codeml results, Paralog Set 2, Supplementary Material online). All sites except P40, would be considered doubtful in PS2 due to their low posterior probability, were not for the fact that five of them were also found in the other analyses with different sequences and alignments.
In total, there were three sites (105E, 121S, and 124S) with posterior probability >0.9 in the NG set not found in either PS1 or PS2 and one site (227S) in PS1 not identified as under selection in NG. Interestingly, the three sites with posterior probability >0.9 found exclusively in the NG set (105E, 121S, and 124S) are located in the hypervariable E helix, where other two residues (117Q and 122L) were identified as positively selected in the three sets of sequences (NG, PS1, and PS2).
In this study, we identified candidate defense response genes by combining information from their expression level, proportion of sequences obtained from the Biotic Stress subgroup of EST libraries, interspecific divergence between sorghum and rice, and prior annotation of the genes in the species involved. We then assessed whether these candidate or known DRGs showed a pattern of substitutions consistent with adaptive evolution. We did this under the hypothesis that millions of years of antagonistic coevolution between cereal-hosts and fungal-pathogen populations may have left a signal of positive selection in these genes. Our strategy for the identification of DRGs took advantage of the increasing abundance of genomic sequences across the grasses, as well as expression data from one or several different species. This allowed us to identify 773 genes whose expression patterns may change in response to the fungal pathogen C. sublineolum and to test a subset of 10 genes for evidence of adaptive evolution. Using this approach, we demonstrate for the first time that six classes of DRGs identified here evolve adaptively, including three previously known PRPs and three genes not implicated before in disease response, all of which are particularly interesting candidates for functional validation studies.
In our initial screen, we selected genes having 50% of their ESTs coming from the Biotic Stress group of libraries and a minimum of three ESTs, on the assumption that genes involved in disease resistance would be enriched in this set. The comparison of the annotation for this set of genes with a control set composed of constitutively expressed S. bicolor genes, revealed that there are several functional categories of genes that show significantly different patterns of expression, in particular, catalytic activity, hydrolase activity, and transcription were substantially reduced, suggesting a decrease in basal metabolic functions and gene expression during active defense or pathogenesis. On the contrary, kinase activity increased, suggesting active signal transduction and amplification of the signal of pathogen attack, events typical of hypersensitive responses and pathogenesis (Frye et al. 2000; Popescu et al. 2009). In sugarcane, kinases are one of the most abundant domains found in ESTs; many of these are found in genes involved in disease resistance; and a few kinase containing genes were exclusive to biotic interaction libraries (Vettore et al. 2003). Other GO categories were present only in the BSU set, including genes with calmodulin-binding activity, involved in calcium signaling which is one of the first events in pathogen detection (Heo et al. 1999); oxygen binding, probably associated with the reactive oxidative burst (McDowell and Dangl 2000); and lipases, possibly involved in the generation of salicylic acid or jasmonic acid precursors, important in signal transduction molecular mechanisms and in the establishment of systemic acquired resistance (Salzman et al. 2005). Additionally, many of the uniscripts represent genes known to be involved in disease resistance or have domains typical of that kind of genes. Some of the most important categories of genes found include: LRR domain proteins, kinases, transcription factors, and several kinds of PRPs such as chitinases, glucanases, peroxidases, and thaumatins (Sticher et al. 1997).
By using BlastN and BlastP values between sorghum and rice orthologs as a measure of divergence, we were able to identify genes that have above average divergence over the 50 Ma because these two species had a common ancestor (Paterson et al. 2004) (supplementary fig. 1, Supplementary Material online), possibly due to recurrent events of positive selection, as expected in an arms race scenario. Among these genes, ML analyses provided statistical support for six genes with selectively driven elevated levels of amino acid substitution, consistent with our hypothesis. The accuracy and power of the kind of analysis used here has been thoroughly studied (Anisimova et al. 2001, 2002) and compared with other tests (Wong et al. 2004). The codeml ML method has been shown to outperform other methods in several different selective scenarios and the LRTs used to compare the models have been determined to be conservative, showing very low type I error rates (Wong et al. 2004). Additionally, because the M7 model is very flexible and includes sites evolving in a range from strict purifying selection to neutrality, the comparison of M8 versus M7 is a very stringent test of positive selection (Anisimova et al. 2001). In addition to the codeml analyses, we tested for evidence of positive selection for these genes using the FEL and REL methods implemented in HyPhy. We only found evidence of positive selection for the peroxidase gene using the FEL method. The sites identified by the FEL method were the same as those identified by codeml, which gives strong support to the hypothesis that peroxidases evolve under positive selection due to their importance in basal and broad spectrum disease resistance (Johrde and Schweizer 2008). Also, the fact that this was the only gene identified by FEL as having positively selected sites is not surprising because the peroxidase gene had the highest value of ω in codeml, and therefore the strongest signal of positive selection, even though only five taxa were used in the alignment. Additionally, all codon-based ML methods for detecting positive selection have low power when the number of taxa is small, but the discrepancy between codeml and the methods implemented in HyPhy suggest either lower power in HyPhy or to higher false positives in codeml. Because there was an agreement in the sites identified in the peroxidase gene using different sets with codeml, but HyPhy found only the two sites with the highest posterior probability in codeml, it appears that the low number of sequences in the alignment resulted in a lower power for detecting positive selection in HyPhy. Both FEL and REL algorithms appear to have been optimized using much larger and divergent data sets from viral effector proteins and might not perform effectively with smaller and more conserved sets as those used in this study. However, as more plant genomes get sequenced, all of these methods will become very useful to develop databases of results of positive selection tests that can be actualized on the fly, and that can allow the rapid study of the cereal “selectome,” just as has been done with vertebrate genome sequence data (Nickel et al. 2007; Proux et al. 2008), and to identify interesting candidates for genetic, biochemical, and physiological studies.
The permutation analysis we performed for peroxidases was useful to identify amino acid residues that are consistently identified as having evolved under positive selection, even though the analyses were done with different sets of sequences and with or without removing the gaps in the sequences. These permutations represent multiple independent biological replicates of the evolutionary experiment of positive selection and give confidence that the identified residues are functionally important. The position of the positively selected residues in critical sites of the crystal structure of the proteins further reinforces that possibility. The use of paralogs led to the identification of a few of the same sites, even in sets including very divergent paralogs (P40 in PS2), suggesting that those sites may have also evolved adaptively during the process of subfunctionalization. Importantly, the inclusion of paralogs does not lead to an increased number of false positives, as all the sites with posterior probabilities >0.9 in PS1 and PS2 were also found in the original set NG, and when we used more divergent paralogs in the analysis, the number of sites identified as positively selected decreased to one site with posterior probability >0.9. This may be due to the effect of having substitutions distributed in different positions in the paralog sequences as a consequence of different selective pressures and evolutionary histories. In multiple alignments of paralogs, there are many substitutions located in different positions of the alignment for each sequence, instead of several substitutions in the same few positions repeatedly, as is the case in the orthologs (fig. 2), and the ML analysis does not identify those sites as positively selected. Therefore, not only the number of positively selected sites predicted decreased when using very divergent paralogs, but the strength of the prediction was lower as well. Thus, using multiple highly divergent paralogs leads to low power to detect positively selected sites because such set of sequences would more closely resemble neutrally evolving sequences. Moreover, although several other sites had lower posterior probabilities, the fact that five sites were reliably identified in the three sets of sequences makes it unlikely that these sites are false positives. It is important to note, however, that poor alignments with the same sequences can lead to the identification of many false positives, as has been reported in the literature.
Thaumatins can apparently not only increase the permeability of hyphal membranes but can also hydrolyze polymeric B-1,3-glucans (Grenier et al. 1999). Additionally, in potato, the overexpression of tobacco PR-5, a thaumatin-like protein, delayed the onset of symptoms after inoculation (Strange 2003) and increased the resistance to Phytophthora infestans (Sticher et al. 1997). Some thaumatins can be induced by osmotic stress or developmentally during cherry ripening (Fils-Lycaon et al. 1996) and are abundant in the seed aril of Thaumatococcus danielii, where the thaumatin is a sweet protein (De Vos et al. 1985), which suggests a double functionality in seed dispersal and defense from fungi, or at least an ancestral role in defense and a neofunctionalization for use in seed dispersal. In groundnut, thaumatin-like proteins were induced by foliar application of Pseudomonas fluorescens and appear to have had a role in the control of leaf spot and rust diseases (Meena et al. 2000). Transgenic wheat plants expressing either a rice-derived thaumatin-like, as well as other line coexpressing chitinase and glucanase transgenes, both categories previously shown to evolve adaptively (Bishop et al. 2000, 2005), showed increased resistance under greenhouse conditions, in the form of delayed onset of disease symptoms. However, these lines did not show any increased resistance to wheat scab under strong inoculation pressure in the field with Fusarium graminarum, suggesting an inefficient initial response to pathogen attack (Anand et al. 2003). Indeed, a large number of PRPs form part of a quantitative resistance mechanism and act synergistically to deter pathogens (van Loon and van Strien 1999).
Peroxidases are enzymes that generate highly toxic molecules collectively referred to as reactive oxygen intermediates (ROIs), such as superoxide anion and hydrogen peroxide, essential in the initial part of the general defense mechanism known as the reactive oxygen burst. This mechanism, common to all eukaryotes, is the first line of defense because it does not require transcription or translation to occur because peroxidases can be stored in the peroxisome, and leads rapidly to programmed cell death, which isolates and kills the pathogen. ROIs may also be used to kill an unrecognized pathogen when other signal transduction pathways reveal that the cell is under attack (Johrde and Schweizer 2008). The S. bicolor 2_7192 peroxidase is expressed mainly as a result of pathogen attack, that is, ESTs were observed only in the PI1 and PIC1 libraries but not in SA1 (supplementary fig. 5, Supplementary Material online). In agreement with this result, in barley, fungal pathogens elicit the increased expression of several PRPs including peroxidase, although the level of salicylic acid (SA) remains low (Vallélian-Bindschedler et al. 1998). Additionally, generation of ROIs occurs both in compatible and incompatible reactions, initially as a weak response, but a second more intense oxidative burst occurs exclusively in incompatible interactions 3–6 h after inoculation (Strange 2003). This is consistent with the observed upregulation of the 2_7192 peroxidase in S. bicolor, where the gene is found mainly in the incompatible reaction (supplementary fig. 5, Supplementary Material online).
In barley, the mlo locus confers broad spectrum resistance to several isolates of the fungal pathogen Blumeria (= Erysiphe) graminis f. sp. hordei, which causes barley mildew. Interestingly, this form of resistance is recessive in inheritance and has proven to be durable (Strange 2003). In tomato, a naturally occurring loss-of-function mutant of the barley mlo homolog also causes broad spectrum resistance to powdery mildew (Bai et al. 2008). Büschges et al. (1997) have suggested that the Mlo wild type allele, which encodes a membrane anchored 60-kDa protein, acts as a negative regulator of defense and that a complete inactivation of this gene leads to constitutively active mechanisms of disease including the expression of multiple PRPs. Additionally, the recessive phenotype shows lesions in axenically grown plants, which indicates an extreme sensitivity to offensive environmental factors, including UV light. Multiple alleles for this gene have been generated by mutagenesis conferring different degrees of lesion mimic phenotype and disease protection. Furthermore, some recombinants of these alleles restore the wild type ORF and the susceptible phenotype (Büschges et al. 1997). Although the sites identified by mutagenesis as being important for the function of this protein in barley do not correspond to the sites identified by us as being under positive selection, this was expected because the former sites have a large deleterious effect in the function of the protein. These mutations usually give a fitness penalty in the absence of the pathogen making it unlikely that they could become fixed in natural populations (Büschges et al. 1997).
The S. bicolor uniscript 2_3804 codes for an eIF5), a protein that makes critical connections with the 40S ribosome in vivo, forms a multifactor complex with eIF1, eIF2, and eIF3 that stimulates Met-tRNAiMet binding to 40S ribosomes and promotes scanning or AUG recognition (Valasek et al. 2003). eIF5-mediated GTP hydrolysis of the ternary complex (ribosome, eIF2, Met–tRNA, and GTP) occurs during the binding of the 60S ribosomal subunit to form the functional 80S ribosome (Robaglia and Caranta 2006). No disease phenotypes have been associated with mutations in eIF5 and because of its essential function in translation it is possible that most induced mutations would be lethal. Accordingly, it is reasonable to expect that such a protein could be an important target for pathogen molecules intended to subvert the host plant. In fact, translation initiation factors are specifically used by potyviruses (Robaglia and Caranta 2006) and naturally occurring variation at eIF4 results in recessive resistance to potyviruses in Capsicum (Yeam et al. 2007) and in Hordeum (Stein et al. 2005). There are examples of recessive inheritance of resistance to anthracnose in S. bicolor (Mehta et al. 2005) and it would be interesting to determine if mutations in translation factors including eIF5 are the underlying cause.
Finally, the ortholog of the single-copy SESPY gene (S. bicolor 2_8705) in rice (the full-length cDNA AK121844.1, also described as the unigene Os11961) is under a Sheath Blight QTL, but there are also several NBS-LRR genes close to this sheath blight QTL (Wisser et al. 2005). An ortholog of this gene, identified through reciprocal Blast comparisons, is also found in Arabidopsis thaliana, AT5G59080 and is annotated as an unknown protein. Luhua et al. (2008) found that constitutive expression of this gene in A. thaliana conferred significantly enhanced tolerance to oxidative stress but not to osmotic or salinity stress. Moreover, the maize ortholog, determined by a match to this gene in NCBI using 2_8705 as a query, has been implicated in stress resistance (Jia et al. 2006). Therefore, the statistical support for adaptive evolution in this gene is strong and there is evidence originating from several species suggesting that this gene is involved in disease/stress resistance and may warrant further analysis.
Although purifying selection rules the evolution of the vast majority of amino acid sites of the studied genes, positive selection at a few critical residues seems to be an important component of the evolution of many genes involved in disease resistance, including those essential for pathogen or disease recognition, signal transduction pathways, and direct attack toward the pathogen. Products of at least three of six positively selected genes presented here appear to play some role in plant defense, and all are interesting candidates for functional validation through silencing (Kessler et al. 2004), biochemical tests of molecular interaction with pathogen effector molecules (Huitema et al. 2004; Tian et al. 2004, 2005, 2007), as well as directed mutagenesis of target sites identified through evolutionary and structural analyses. Such studies may reveal the specific molecular interactions that are involved, providing hypotheses for the nature of selective forces driving evolution at these loci. For instance, Bishop et al. (2005) showed that the positively selected sites found in plant polygalacturonase inhibitors correspond to sites proven to be functionally important in biochemical studies and that other sites predicted to be under positive selection may also be essential due to their position in the active site of the enzyme.
The determination that some of these genes have evolved under positive selection suggests that they may be targets for pathogen's virulence genes, that is, pathogen-derived molecules designed to avoid recognition or undermine host defenses. Alternatively, they can be parts of direct defense mechanisms that change rapidly to adapt to an ever-changing pathogen. Thus, the identification of the conserved regions and particularly the derived mutations that made these genes more advantageous to the different grass lineages reveal the critical regions of the protein, as well as pathways that pathogens tend to attack. This information may help identify, conserve, and use individuals from natural populations that possess functionally different alleles and to quantify the extant genetic diversity at functional sites. This information can also lead to the rational design and engineering of proteins involved in plant disease resistance.
The number of candidates identified in this study is rather small for a genomewide scale analysis and this was due to the limitations of the data available for several of the large and repetitive genome species, particularly maize, sugarcane, and wheat, at the time of analysis. Although there are numerous ESTs from these species, these data produce many incomplete transcripts for an unknown fraction of the total genes, making it difficult to determine orthology through reciprocal best hits in Blast comparisons. Additionally, sequencing errors are also common in the kind of data used and it is not a trivial matter to differentiate them from very similar paralogs. Therefore, in this study, we preferred to err in the side of caution and be very careful to avoid including false positives caused by the mentioned limitations. Nevertheless, the fact that 6 of these 10 genes are under positive selection is quite striking and provides 5 new candidates for future functional analyses.
Another factor that could affect the identification of DRGs using the strategy presented here is the degree of conservation. A few canonical Resistance genes (NBS-LRRs) appear to evolve more slowly and to maintain old alleles, likely due to the effects of balancing selection (McDowell and Simon 2006). Also, Bakker et al. (2008) conducted a molecular diversity survey of 27 defense response genes involved in the salicylic acid, jasmonic acid, and ethylene pathways and found that most of these genes show purifying selection. Thus, by focusing on moderately to rapidly divergent genes, it is likely that we missed several legitimate and important DRGs. In fact, there are several highly upregulated but conserved genes in sorghum (fig. 1) that could be important in disease response and warrant further investigation. It is possible that a molecular population genetic analysis of such conserved genes will show a similar scenario as that described by Bakker et al. (2008), because low divergence should be correlated to low nonsynonymous polymorphism, as a result of strict purifying selection. One of the genes described here as only having signals of purifying selection but with prior information of a role in defense, glycosyl transferase (2_9185), was determined to be more conserved than the cutoff value we used in this study (BlastP, E = 10−50) once we reconstructed the complete ORF. This gene was a false positive for divergence but may not be a false positive for disease resistance, even though its evolution is ruled by negative selection. Highly conserved genes, in which nonsynonymous substitutions would lead to reduced fitness most of the time, such as those essential for defense signal transduction pathways (Bakker et al. 2008), are probably guarded by R genes in order to protect the individual plant at the expense of individual cells (Dangl and Jones 2001; Van der Hoorn et al. 2002).
In our study, however, we decided to identify candidate disease resistance genes, not only by the information on the literature, but also based on high upregulation under pathogen attack and particularly based on high divergence. Therefore, this fundamental difference in our approach makes it more likely to identify genes under positive selection. Indeed, a few of the genes identified using this strategy and that showed high values of dN/dS ratio are known PRPs, namely, thaumatins and peroxidases. Other genes found here may also have an important role in quantitative disease resistance and can now be functionally tested.
Additionally, we feel that this study demonstrates the feasibility of identifying genes that may be important in other kinds of adaptive process, for example, drought tolerance in sorghum, and that it should be simple and useful to implement this kind of analyses in automated pipelines, for instance in Gramene, that will allow end users to identify functional polymorphisms and take full advantage of the rapid advances in genome sequencing and the exponential increase in high-quality sequence data, full-length cDNAs, proteomic analyses, expression profiles, and several other kinds of functional genomic studies currently being conducted in the cereals. As the number of full-length coding sequences increases for the different cereal genomes, this type of analysis will be easier to do for thousands of genes and it is likely that signals of adaptive evolution will be found in many more DRGs involved in several aspects of disease resistance.
This work was supported by the National Science Foundation (0115903 to S.K.). We thank Sharon Mitchell, Alex Casa, Peter Moffett, and three anonymous reviewers for valuable comments on the manuscript. We also want to give special thanks to Lee Pratt for the use of MAGIC and valuable comments during the development of this project. Many thanks to Jarek Pillardy, from CBSU, who provided suggestions and guidance with the bioinformatics analysis; Chenwei Lin helped with Perl scripts to parse the InterProScan output; and Immanuel Yap for help with the use of GRAMENE. A.Z.M. is supported by the Costa Rica-USA Foundation for Cooperation (CRUSA) through a special Fulbright scholarship in biotechnology and by the Institute of Genomic Diversity (Cornell University, Ithaca, NY).