Rapid development of sequencing technologies and bioinformatic tools makes the complete genome sequencing of many species possible, which provides a starting point to unravel the tremendous genetic variation and diversity at the genome scale. Amongst several model organisms examined to date, such as human, mouse, Arabidopsis
, rice, and maize, genome-wide patterns of genetic variation are able to be captured by sampling a relatively small number of genomes [14
]. By resequencing two sweet and one grain sorghum inbred lines, we uncovered nearly two million SNPs and indels, along with large numbers of PAVs and CNVs. This is a first report on the genome-wide patterns of genetic variation in sorghum, which will be valuable for further genotype-phenotype studies and for molecular breeding of this important C4
Our study shows that the proportions of genic SNPs identified as in coding regions, intronic regions, or UTRs are 42.3%, 50.2%, and 7.5%, respectively. Compared to Arabidopsis
] and rice [14
], the intronic regions of sorghum genes harbor more SNPs. This might be related to the increased size of the introns; the average intron size for Arabidopsis
is 168 bp, and for rice it is 397 bp, but for sorghum it is 444 bp. Our results also demonstrate that, in sorghum, the proportions of large-effect SNPs resulting in premature stop codons, alteration of initiation methionine residues and disruption of splicing donor or acceptor sites are remarkably similar to what have been reported so far in Arabidopsis
] and maize [22
], but different from rice [20
]. Furthermore, we found that 16 SNPs removed annotated stop codons and resulted in longer open reading frames, which is substantially smaller than the number (1,087) in maize.
It is known that transposon elements are abundant in sorghum as well as other cereal genomes [40
]. As the genome annotation is not perfect, caution should be exercised with regard to the analysis of the effects of SNPs. Indeed, we found that the transposase genes, pseudogenes and low-confidence genes tended to have high non-synonymous-to-synonymous ratios in comparison with bona fide
genes. This was reflected in the Pfam SNP annotations as well as in the analysis of so-called large-effect SNPs, which are predicted to disable gene functions. Most of the SNPs resided in receptor-like kinases, PPR repeats, disease resistant NB-ARC genes and other genes with multiple effects on stress responses. These genes also exhibited high non-synonymous-to-synonymous ratios, further supporting the notion from studies in other species that an arms race between plant-pathogen interactions results in diversification of the pathogen- or microbe-associated molecular pattern recognition receptors in plant genomes [54
]. Significantly, the highest non-synonymous substitution ratios were found in X8 domain and glycoside hydrolase family 17 (glucan endo-1,3-beta-glucosidase) genes, which has not been reported in Arabidopsis
], rice [20
] or maize [22
]. Current annotations show that limited low-confidence genes were included in these two Pfam gene families, although we cannot rule out the possibility that these genes are pseudogenes, or truncated because of the transposon elements. Further studies are required to validate whether they are related to specific biological processes in sorghum. However, the function of these genes in carbohydrate binding as well as in cell wall biosynthesis certainly provides clues to manipulating genes of interest for biofuel production.
In sorghum, the 14 gene families enriched with large-effect SNPs comprise genes encoding DUF proteins with unknown functions or include transponsons, which appear to be nonfunctional but may affect genetic variation at the genome scale. Furthermore, gene families involved in biotic and abiotic stress tolerance, which do not contain transposons, also harbor enriched large-effect SNPs. For instance, over-expression of lecithin:cholesterol acyltransferase can increase lipid metabolism and the fluidity of membranes and hence the resistance to heat and/or cold shock (United States Patent Application 20050150007), whereas chalcone synthase in flavonoid biosynthesis and stilbene synthases for phytoalexin biosynthesis play important roles in sorghum disease resistance [56
]. None of these gene families were reported to be enriched with large-effect SNPs in Arabidopsis
, rice or maize. This could be due to genome/species-specific diversity, or result from the prediction algorithms used. Alternatively, this may also be related to the limited sorghum lines used, which have diverse relationships.
This effort also uncovered substantial numbers of indels and PAVs in the sorghum genomes. Indels that are not multiples of 3 bp were particularly uncommon in coding regions but relatively common in non-coding regions. This implies that most frameshift mutations are harmful to sorghum survival. The spectrum of gene families affected by indels and PAVs was similar to that of large-effect SNPs. This implies that although the origins and scales of affected genome segments may differ, SNPs, indels and PAVs may share similar survival and distribution patterns, at least in terms of gene families affected. CNV studies in plants lag behind those in animal and human models. Recent studies in maize showed its potential contribution to the heterosis of this crop during domestication and disease responses [22
]. CNVs also shaped the genome diversity of progeny of the immediate next generations in Arabidopsis
]. In the sorghum genomes, CNVs were present in several thousand genes, and some of the commonly involved genes are involved in basic biological functions as well as sugar- and bioenergy-associated traits. How this variation is associated with phenotypic variation is a new direction of future research.
The resequenced sorghum lines contained two elite sweet sorghum lines and one local elite Chinese grain sorghum line. We were able to identify genetic variation in 1,442 genes differentiating sweet and grain sorghum. Some of these genes are involved in the starch and sucrose metabolism pathway and the lignin- and coumarine-biosynthesis-associated phenylpropanoid biosynthesis pathway, which are obvious candidates for sugar and biofuel production and deserve further study. Five genes in the starch and sucrose metabolism pathway were identified and are located on chromosomes 2, 6 and 9. In the phenylpropanoid biosynthesis pathway the cinnamyl-alcohol dehydrogenase gene (Sb06g028240
, encoding EC 188.8.131.52) on chromosome 6 plays a central role in lignin biosynthesis. Previous genetic analyses have identified several quantitative trait loci controlling stem Brix content, grain yields, plant height and biomass on the same chromosomes [35
]. However, due to the lack of the links between the genome physical map and the genetic linkage maps, it is hard to judge whether these genes and quantitative trait loci co-localize and further genetics and functional genomics studies are required to characterize the functions of these genes. Some of these gene families and pathways, may not be directly associated with sugar and biofuel traits, but rather reflect variation inherited from their different origins and/or caused by breeding selection. It is known that sweet sorghums are of polyphyletic origin, spreading from the kafir, caudatum, bicolor and other grain sorghum types [37
]. Furthermore, using the BTx623 genome as a reference, the Chinese kaoliang
line Ji2731 was found to harbor a lot more genetic variation than the other two lines (Additional file 6
). Further genome-wide analysis with a panel of sweet and grain sorghum lines, close relatives of sorghum, as well as Chinese kaoliang
is required to illustrate the complex relationships.